PDF stats

August 26, 2010

Follow feliam on Twitter

This is an example use of the opaflib. The script described here use opaflib to get some statistics about the different PDF objects that appear in you file stash. This 2 charts show the appearing frequencies of Filters and Object types in a 10Mbyte small database of a random pdf selection.

So it is better for your fuzzing base that this numbers seem even, otherwise you’ll be testing the same thing over and over.

Keep reading for more exciting details!!! WEEEEEEEEEEEE!
Read the rest of this entry »



August 23, 2010

It’s an Open PDF Analysis Framework!

A pdf file rely on a complex file structure constructed from a set tokens, and grammar rules. Also each token being potentially compressed, encrypted or even obfuscated. Open PDF Analysis Framework will understand, decompress, de-obfuscate this basic pdf elements and present the resulting soup as a clean XML tree(done!). From there the idea is to compile a set of rules that can can be used to decide what to keep, what to cut out and ultimately if it is safe to open the resulting pdf projection(todo!).

Its written in python using PLY parser generator. The project page is here and you can get the code from here:

svn checkout http://opaf.googlecode.com/svn/trunk/ opaf-read-only

Keep reading for a test run…
Read the rest of this entry »

PDF sequential parsing

August 22, 2010

As discussed in earlier posts the problem with PDF is that we can not apply an out-of-the-box scanner/parser design pattern. It won’t let you scan it properly. The size of a PDF stream is hard to be decided at scanner/lexer time. I’ve suggested the solution of escaping the “endstream” keyword. Also other patches emerged like, forcing the /Length keyword to be direct. Or calculate every object size using XREFs pointers (assuming not garbage between the objs (which in fact is what the spec says)).

Well in any case if you manage to run a lexer and tokenize it here you have the parsing grammar … weeee!!

object : NAME | STRING | HEXSTRING | NUMBER | TRUE | FALSE | NULL | R | dictionary | array 

dictionary : DOUBLE_LESS_THAN_SIGN dictionary_entry_list DOUBLE_GREATER_THAN_SIGN 

dictionary_entry_list : NAME object dictionary_entry_list
                      | empty  


object_list : object object_list 
            | empty

indirect : indirect_object_stream
         | indirect_object 

indirect_object : OBJ object ENDOBJ 
indirect_object_stream : OBJ dictionary STREAM_DATA ENDOBJ 

xref : indirect_object_stream 
     | XREF TRAILER dictionary 

pdf : HEADER pdf_update_list
pdf_update_list : pdf_update_list body xref pdf_end
                | body xref pdf_end

body : body indirect_object 
     | body indirect_object_stream 
     | empty



Follow feliam on Twitter

In an attempt to irrevocably declare my insanity I went into the details of making a PDF lexer the most strict to the specification I can. This post is about making a Portable File Format lexer in python using the PLY parser generator. This lexer is based on the ISO 32000-1 standard. Yes! PDF is an ISO standard, see.

In a PDF we have hexstrings and strings, numbers, names, arrays, references and null, booleans, dictionaries, streams and the file structure entities (the header, the trailer dictionary, the eof mark, the startxref mark and the crossreference). We are going to describe in detail all the tokens needed to define the named entities. You’ll probably want to take a look on how a parser is written in PLY at this simple example.


Before we go into the really really really boring stuff, let’s do a quick demonstration of it’s value…
Let’s pick a random PDF out there… hmm.. for example jailbrakeme.pdf. Then grab the already done lexer here and run it like this…

python lexer.py “iPhone3,1_4.0.pdf”

it should output something like this…

iPhone3,1_4.0.pdf LexToken(HEADER,'1.3',1,0) LexToken(OBJ,('4', '0'),1,22) LexToken(DOUBLE_LESS_THAN_SIGN,'<>',1,45) LexToken(STREAM_DATA,'q Q q 18 750 576 24 re W n /C ... ( ) Tj ET Q Q',1,48) LexToken(ENDOBJ,'endobj',1,696) LexToken(OBJ,('2', '0'),1,703) LexToken(DOUBLE_LESS_THAN_SIGN,'<>',1,797) LexToken(ENDOBJ,'endobj',1,800) LexToken(OBJ,('6', '0'),1,807) LexToken(DOUBLE_LESS_THAN_SIGN,'<<',1,815) LexToken(NAME,'ProcSet',1,818) LexToken(LEFT_SQUARE_BRACKET,'[',1,827) LexToken(NAME,'PDF',1,829) LexToken(NAME,'Text',1,834) LexToken(RIGHT_SQUARE_BRACKET,']',1,840) LexToken(NAME,'ColorSpace',1,842) LexToken(DOUBLE_LESS_THAN_SIGN,'<>',1,868) LexToken(NAME,'Font',1,871) LexToken(DOUBLE_LESS_THAN_SIGN,'<>',1,892) LexToken(DOUBLE_GREATER_THAN_SIGN,'>>',1,895) LexToken(ENDOBJ,'endobj',1,898) LexToken(OBJ,('3', '0'),1,905) LexToken(DOUBLE_LESS_THAN_SIGN,'<>',1,978) LexToken(ENDOBJ,'endobj',1,981) LexToken(OBJ,('12', '0'),1,988) LexToken(DOUBLE_LESS_THAN_SIGN,'<>',1,1028) LexToken(ENDOBJ,'endobj',1,1031) LexToken(OBJ,('13', '0'),1,1038) LexToken(DOUBLE_LESS_THAN_SIGN,'<>',1,1102) LexToken(STREAM_DATA,'x\x9c\xed}\rXT\xd7\xd5\xee\x1e...",1,1105) LexToken(ENDOBJ,'endobj',1,11834) LexToken(OBJ,('15', '0'),1,11841) LexToken(DOUBLE_LESS_THAN_SIGN,'<>',1,12058) LexToken(ENDOBJ,'endobj',1,12061) LexToken(OBJ,('16', '0'),1,12068) LexToken(LEFT_SQUARE_BRACKET,'[',1,12077) LexToken(NUMBER,'556',1,12079) LexToken(RIGHT_SQUARE_BRACKET,']',1,12083) LexToken(ENDOBJ,'endobj',1,12085) LexToken(OBJ,('9', '0'),1,12092) LexToken(DOUBLE_LESS_THAN_SIGN,'<>',1,12254) LexToken(ENDOBJ,'endobj',1,12257) LexToken(OBJ,('18', '0'),1,12264) LexToken(NUMBER,'9332',1,12273) LexToken(ENDOBJ,'endobj',1,12278) LexToken(OBJ,('20', '0'),1,12285) LexToken(LEFT_SQUARE_BRACKET,'[',1,12294) LexToken(NUMBER,'316',1,12296) LexToken(NUMBER,'0',1,12300) . LexToken(NUMBER,'613',1,12516) LexToken(RIGHT_SQUARE_BRACKET,']',1,12520) LexToken(ENDOBJ,'endobj',1,12522) LexToken(OBJ,('1', '0'),1,12529) LexToken(DOUBLE_LESS_THAN_SIGN,'<>',1,12540) LexToken(ENDOBJ,'endobj',1,12543) LexToken(XREF,[((0, 29), [(0, 65535, 'f'),...(17744, 0, 'n')])],1,12550) LexToken(TRAILER,'trailer',1,13140) LexToken(DOUBLE_LESS_THAN_SIGN,'<>',1,13263) LexToken(STARTXREF,17942,1,13266) LexToken(EOF,'%%EOF\n',1,13282)

It marks the position of every object!!! WOW!!!!!!

Read the rest of this entry »