Follow feliam on Twitter

In an attempt to irrevocably declare my insanity I went into the details of making a PDF lexer the most strict to the specification I can. This post is about making a Portable File Format lexer in python using the PLY parser generator. This lexer is based on the ISO 32000-1 standard. Yes! PDF is an ISO standard, see.

In a PDF we have hexstrings and strings, numbers, names, arrays, references and null, booleans, dictionaries, streams and the file structure entities (the header, the trailer dictionary, the eof mark, the startxref mark and the crossreference). We are going to describe in detail all the tokens needed to define the named entities. You’ll probably want to take a look on how a parser is written in PLY at this simple example.

QUICK DEMO

Before we go into the really really really boring stuff, let’s do a quick demonstration of it’s value…
Let’s pick a random PDF out there… hmm.. for example jailbrakeme.pdf. Then grab the already done lexer here and run it like this…

python lexer.py “iPhone3,1_4.0.pdf”

it should output something like this…

iPhone3,1_4.0.pdf LexToken(HEADER,'1.3',1,0) LexToken(OBJ,('4', '0'),1,22) LexToken(DOUBLE_LESS_THAN_SIGN,'<>',1,45) LexToken(STREAM_DATA,'q Q q 18 750 576 24 re W n /C ... ( ) Tj ET Q Q',1,48) LexToken(ENDOBJ,'endobj',1,696) LexToken(OBJ,('2', '0'),1,703) LexToken(DOUBLE_LESS_THAN_SIGN,'<>',1,797) LexToken(ENDOBJ,'endobj',1,800) LexToken(OBJ,('6', '0'),1,807) LexToken(DOUBLE_LESS_THAN_SIGN,'<<',1,815) LexToken(NAME,'ProcSet',1,818) LexToken(LEFT_SQUARE_BRACKET,'[',1,827) LexToken(NAME,'PDF',1,829) LexToken(NAME,'Text',1,834) LexToken(RIGHT_SQUARE_BRACKET,']',1,840) LexToken(NAME,'ColorSpace',1,842) LexToken(DOUBLE_LESS_THAN_SIGN,'<>',1,868) LexToken(NAME,'Font',1,871) LexToken(DOUBLE_LESS_THAN_SIGN,'<>',1,892) LexToken(DOUBLE_GREATER_THAN_SIGN,'>>',1,895) LexToken(ENDOBJ,'endobj',1,898) LexToken(OBJ,('3', '0'),1,905) LexToken(DOUBLE_LESS_THAN_SIGN,'<>',1,978) LexToken(ENDOBJ,'endobj',1,981) LexToken(OBJ,('12', '0'),1,988) LexToken(DOUBLE_LESS_THAN_SIGN,'<>',1,1028) LexToken(ENDOBJ,'endobj',1,1031) LexToken(OBJ,('13', '0'),1,1038) LexToken(DOUBLE_LESS_THAN_SIGN,'<>',1,1102) LexToken(STREAM_DATA,'x\x9c\xed}\rXT\xd7\xd5\xee\x1e...",1,1105) LexToken(ENDOBJ,'endobj',1,11834) LexToken(OBJ,('15', '0'),1,11841) LexToken(DOUBLE_LESS_THAN_SIGN,'<>',1,12058) LexToken(ENDOBJ,'endobj',1,12061) LexToken(OBJ,('16', '0'),1,12068) LexToken(LEFT_SQUARE_BRACKET,'[',1,12077) LexToken(NUMBER,'556',1,12079) LexToken(RIGHT_SQUARE_BRACKET,']',1,12083) LexToken(ENDOBJ,'endobj',1,12085) LexToken(OBJ,('9', '0'),1,12092) LexToken(DOUBLE_LESS_THAN_SIGN,'<>',1,12254) LexToken(ENDOBJ,'endobj',1,12257) LexToken(OBJ,('18', '0'),1,12264) LexToken(NUMBER,'9332',1,12273) LexToken(ENDOBJ,'endobj',1,12278) LexToken(OBJ,('20', '0'),1,12285) LexToken(LEFT_SQUARE_BRACKET,'[',1,12294) LexToken(NUMBER,'316',1,12296) LexToken(NUMBER,'0',1,12300) . LexToken(NUMBER,'613',1,12516) LexToken(RIGHT_SQUARE_BRACKET,']',1,12520) LexToken(ENDOBJ,'endobj',1,12522) LexToken(OBJ,('1', '0'),1,12529) LexToken(DOUBLE_LESS_THAN_SIGN,'<>',1,12540) LexToken(ENDOBJ,'endobj',1,12543) LexToken(XREF,[((0, 29), [(0, 65535, 'f'),...(17744, 0, 'n')])],1,12550) LexToken(TRAILER,'trailer',1,13140) LexToken(DOUBLE_LESS_THAN_SIGN,'<>',1,13263) LexToken(STARTXREF,17942,1,13266) LexToken(EOF,'%%EOF\n',1,13282)

It marks the position of every object!!! WOW!!!!!!

Read the rest of this entry »

Follow

Get every new post delivered to your Inbox.