Lexing PDF, just for the not-fun of it.

August 6, 2010


Follow feliam on Twitter

In an attempt to irrevocably declare my insanity I went into the details of making a PDF lexer the most strict to the specification I can. This post is about making a Portable File Format lexer in python using the PLY parser generator. This lexer is based on the ISO 32000-1 standard. Yes! PDF is an ISO standard, see.

In a PDF we have hexstrings and strings, numbers, names, arrays, references and null, booleans, dictionaries, streams and the file structure entities (the header, the trailer dictionary, the eof mark, the startxref mark and the crossreference). We are going to describe in detail all the tokens needed to define the named entities. You’ll probably want to take a look on how a parser is written in PLY at this simple example.

QUICK DEMO

Before we go into the really really really boring stuff, let’s do a quick demonstration of it’s value…
Let’s pick a random PDF out there… hmm.. for example jailbrakeme.pdf. Then grab the already done lexer here and run it like this…

python lexer.py “iPhone3,1_4.0.pdf”

it should output something like this…

iPhone3,1_4.0.pdf LexToken(HEADER,'1.3',1,0) LexToken(OBJ,('4', '0'),1,22) LexToken(DOUBLE_LESS_THAN_SIGN,'<>',1,45) LexToken(STREAM_DATA,'q Q q 18 750 576 24 re W n /C ... ( ) Tj ET Q Q',1,48) LexToken(ENDOBJ,'endobj',1,696) LexToken(OBJ,('2', '0'),1,703) LexToken(DOUBLE_LESS_THAN_SIGN,'<>',1,797) LexToken(ENDOBJ,'endobj',1,800) LexToken(OBJ,('6', '0'),1,807) LexToken(DOUBLE_LESS_THAN_SIGN,'<<',1,815) LexToken(NAME,'ProcSet',1,818) LexToken(LEFT_SQUARE_BRACKET,'[',1,827) LexToken(NAME,'PDF',1,829) LexToken(NAME,'Text',1,834) LexToken(RIGHT_SQUARE_BRACKET,']',1,840) LexToken(NAME,'ColorSpace',1,842) LexToken(DOUBLE_LESS_THAN_SIGN,'<>',1,868) LexToken(NAME,'Font',1,871) LexToken(DOUBLE_LESS_THAN_SIGN,'<>',1,892) LexToken(DOUBLE_GREATER_THAN_SIGN,'>>',1,895) LexToken(ENDOBJ,'endobj',1,898) LexToken(OBJ,('3', '0'),1,905) LexToken(DOUBLE_LESS_THAN_SIGN,'<>',1,978) LexToken(ENDOBJ,'endobj',1,981) LexToken(OBJ,('12', '0'),1,988) LexToken(DOUBLE_LESS_THAN_SIGN,'<>',1,1028) LexToken(ENDOBJ,'endobj',1,1031) LexToken(OBJ,('13', '0'),1,1038) LexToken(DOUBLE_LESS_THAN_SIGN,'<>',1,1102) LexToken(STREAM_DATA,'x\x9c\xed}\rXT\xd7\xd5\xee\x1e...",1,1105) LexToken(ENDOBJ,'endobj',1,11834) LexToken(OBJ,('15', '0'),1,11841) LexToken(DOUBLE_LESS_THAN_SIGN,'<>',1,12058) LexToken(ENDOBJ,'endobj',1,12061) LexToken(OBJ,('16', '0'),1,12068) LexToken(LEFT_SQUARE_BRACKET,'[',1,12077) LexToken(NUMBER,'556',1,12079) LexToken(RIGHT_SQUARE_BRACKET,']',1,12083) LexToken(ENDOBJ,'endobj',1,12085) LexToken(OBJ,('9', '0'),1,12092) LexToken(DOUBLE_LESS_THAN_SIGN,'<>',1,12254) LexToken(ENDOBJ,'endobj',1,12257) LexToken(OBJ,('18', '0'),1,12264) LexToken(NUMBER,'9332',1,12273) LexToken(ENDOBJ,'endobj',1,12278) LexToken(OBJ,('20', '0'),1,12285) LexToken(LEFT_SQUARE_BRACKET,'[',1,12294) LexToken(NUMBER,'316',1,12296) LexToken(NUMBER,'0',1,12300) . LexToken(NUMBER,'613',1,12516) LexToken(RIGHT_SQUARE_BRACKET,']',1,12520) LexToken(ENDOBJ,'endobj',1,12522) LexToken(OBJ,('1', '0'),1,12529) LexToken(DOUBLE_LESS_THAN_SIGN,'<>',1,12540) LexToken(ENDOBJ,'endobj',1,12543) LexToken(XREF,[((0, 29), [(0, 65535, 'f'),...(17744, 0, 'n')])],1,12550) LexToken(TRAILER,'trailer',1,13140) LexToken(DOUBLE_LESS_THAN_SIGN,'<>',1,13263) LexToken(STARTXREF,17942,1,13266) LexToken(EOF,'%%EOF\n',1,13282)

It marks the position of every object!!! WOW!!!!!!

Character sets

The PDF character set is divided into three classes, called regular, delimiter, and white-space characters. This classification determines the grouping of characters into tokens. The rules defined in this sub-clause apply to all characters in the file except within strings.

The white spaces …

white_spaces_r = r"\x20\r\n\t\x0c\x00" white_spaces = "\x20\r\n\t\x0c\x00"

And the delimiter characters (, ), , [, ], {, }, /, and % …

delimiters = r"()[]/%" #This is odd: {} ? delimiters_r = r"()\[\]/%" #This is odd: {} ?

As the first appearing hack we have that the CARRIAGE RETURN (0Dh) and LINE FEED (0Ah) characters, also called newline characters, shall be treated as end-of-line (EOL) markers. The combination of a CARRIAGE RETURN followed immediately by a LINE FEED shall be treated as one EOL marker.

eol = r'(\r|\n|\r\n)'

Boolean Objects

Boolean objects represent the logical values of true and false. They appear in PDF files using the keywords true and false.

t_TRUE = "true" t_FALSE = "false"

Literal Strings

A literal string shall be written as an arbitrary number of characters enclosed in parentheses. Any characters may appear in a string except unbalanced parentheses and the backslash, which shall be treated specially as described in this sub-clause. Balanced pairs of parentheses within a string require no special treatment.

EXAMPLE 1 The following are valid literal strings:
( This is a string )
( Strings may contain newlines
and such . )
( Strings may contain balanced parentheses ( ) and
special characters ( * ! & } ^ % and so on ) . )
( The following is an empty string . )
()
( It has zero ( 0 ) length . )

Parsing this is INSANE! A string lexer should keep going until every parenthesis is balanced. So we need to keep track of the number of parenthesis we have consumed. For that we use different lexer states. But firs let see how we start scanning one of this thins… that is with a LEFT_PARENTHESIS:

def t_string_LEFT_PARENTHESIS(t): r"\(" t.lexer.push_state('string') t.lexer.string += "("

Any normal char we just consume and add it to the string accumulator…

def t_string_LITERAL_STRING_CHAR(t): r'.' t.lexer.string += t.value

Any ESCAPED character inside a string, like an octal encoded char or \r, \n, \t, \b, \f, or \\ is lexed like this…

@TOKEN(r'\\([nrtbf()\\]|[0-7]{1,3}|'+eol+')') def t_string_ESCAPED_SEQUENCE(t): val = t.value[1:] if val[0] in '0123': value = chr(int(val,8)) elif val[0] in '4567': value = chr(int(val[:2],8)) + val[3:] else: value = { "\n": "", "\r": "", "n": "\n", "r": "\r", "t": "\t", "b": "\b", "f": "\f", "(": "(", ")": ")", "\\": "\\" }[val[0]] t.lexer.string += value

ALSO the newlines inside strings are treated differently. An end-of-line marker appearing within a literal string without a preceding REVERSE SOLIDUS shall be treated as a byte value of (0Ah), irrespective of whether the end-of-line marker was a CARRIAGE RETURN (0Dh), a LINE FEED (0Ah), or both.

@TOKEN(eol) def t_string_LITERAL_STRING_EOL(t): t.lexer.string += "\x0A"

And lastly the lexer state stacking thing that deals with the parenthesis balancing insanity.

def t_string_LEFT_PARENTHESIS(t): r"\(" t.lexer.push_state('string') t.lexer.string += "(" def t_string_RIGHT_PARENTHESIS(t): r"\)" t.lexer.pop_state() if t.lexer.current_state() == 'string': t.lexer.string += ")" else: t.type = "STRING" t.value = t.lexer.string return t

Hexadecimal Strings

Strings may also be written in hexadecimal form, which is useful for including arbitrary binary data in a PDF file.A hexadecimal string shall be written as a sequence of hexadecimal digits (0-9 and either A-F or a-f) encoded as ASCII characters and enclosed within angle brackets .

EXAMPLE 1

Each pair of hexadecimal digits defines one byte of the string. White-space characters shall be ignored. If the final digit of a hexadecimal string is missing -that is, if there is an odd number of digits- the final digit shall be assumed to be 0.

@TOKEN(r'') def t_HEXSTRING(t): t.value = ''.join([c for c in t.value if c not in white_spaces+""]) t.value = (t.value+('0'*(len(t.value)%2))).decode('hex') return t

Name objects

Beginning with PDF 1.2 a name object is an atomic symbol uniquely defined by a sequence of any characters (8-bit values) except null (character code 0). PDF names are basically everything starting with a “/” and ending with some delimiter. In any case we need a different lexer state to handle this.

It starts wit a SOLIDUS:

def t_NAME(t): r'/' t.lexer.push_state('name') t.lexer.name = "" t.lexer.start = t.lexpos

Any character in a name that is a regular character (other than NUMBER SIGN) shall be written as itself or by using its 2-digit hexadecimal code, preceded by the NUMBER SIGN.

def t_name_HEXCHAR(t): r'\#[0-9a-fA-F]{2}' assert t.value != "#00" t.lexer.name += t.value[1:].decode('hex')

Any “normal character” (not a delimiter, nor a whitespace) is consumed directly…

@TOKEN(r'[^'+white_spaces_r+delimiters_r+']') def t_name_NAMECHAR(t): t.lexer.name += t.value

And it ends a return the token otherwise…

@TOKEN(r'['+white_spaces_r+delimiters_r+']') def t_name_WHITESPACE(t): global stream_len t.lexer.pop_state() t.lexer.lexpos -= 1 t.lexpos = t.lexer.start t.type = "NAME" t.value = t.lexer.name t.lexer.name="" return t

Array Objects

An array shall be written as a sequence of objects enclosed in [ and ].

[ 549 3.14 false ( Ralph ) /SomeName ]

At last something fairly simple! We just need to scan for this…

t_LEFT_SQUARE_BRACKET = r"\[" t_RIGHT_SQUARE_BRACKET = r"\]"

Dictionary Objects

A dictionary shall be written as a sequence of key-value pairs enclosed in double angle brackets (<>)

Again simple thing..

t_DOUBLE_LESS_THAN_SIGN = r'<<'

Stream Objects

A stream object, like a string object, is a sequence of bytes. A stream shall consist of a dictionary followed by zero or more bytes bracketed between the keywords stream(followed by newline) and endstream.
Note that the keyword “endstream” may appear in the middle of a stream making it impossible to scan. For that reason the stream dictionary MUST have a \Length key in it to disambiguate the length (and in some cases accelerate the scan) of the following stream.

By now we do not take the Length key in consideration and scan until we found the next “endstream”.

def t_STREAM_DATA(t): r'stream(\r\n|\n)' found = t.lexer.lexdata.find('endstream',t.lexer.lexpos) stream_len = None if found != -1: chop = 0 if t.lexer.lexdata[found-3] == '\r': chop = {'\r':1, '\n':2}[t.lexer.lexdata[found-2]] elif t.lexer.lexdata[found-2] in ['\n','\r']: chop = 1 else: #TODO log errors pass t.value = t.lexer.lexdata[t.lexer.lexpos: found -1 - chop] t.lexer.lexpos = found + 9 t.type = "STREAM_DATA" else: raise Exception("Error:Parsing:Lexer: COuld not found endstream string.") return t

Indirect Objects

Any object in a PDF file may be labeled as an indirect object.The definition of an indirect object in a PDF file shall consist of its object number and generation number(separated by white space), followed by the value of the object bracketed between the keywords obj and endobj.
The “obj N M ” keyword…

def t_OBJ(t): r'\d+\x20\d+\x20obj' #[0-9]{1,10} [0-9]+ obj' t.value = tuple(t.value.split("\x20")[:2]) return t

and the endboj…

t_ENDOBJ = r'endobj'

The object may be referred to from elsewhere in the file by an indirect reference. Such indirect references shall consist of the object number, the generation number, and the keyword R (with white space separating each
part):

12 0 R

def t_R(t): r'\d+\x20\d+\x20R' t.value = tuple([int(x,10) for x in t.value.split("\x20")[:2] ]) return t

The null object has a type and value that are unequal to those of any other object. There shall be only one object of type null, denoted by the keyword null.

t_NULL = r'null'

Numeric Objects

34.5 -3.62 +123.6 4. -.002 0.0 123 43445 +17 -98 0

PDF provides two types of numeric objects: integer and real. Integer objects represent mathematical integers. Real objects represent mathematical real numbers.

def t_NUMBER(t): r'[+-]{0,1}(\d*\.\d+|\d+\.\d*|\d+)' return t

File Header

The first line of a PDF file shall be a header consisting of the 5 characters %PDF- followed by a version number of the form 1.N, where N is a digit between 0 and 7.

def t_HEADER(t): r'%PDF-1\.[0-7]' t.value = t.value[-3:] return t

Cross-Reference Table

Nowadays it seems that every “good” PDF out there is using eventually compressed crossreferences streams, but still the following is the most simple cross referencing way described in the spec. Each cross-reference section shall begin with a line containing the keyword xref.

@TOKEN(r'xref[' + white_spaces_r +']*'+eol) def t_XREF(t): t.lexer.push_state('xref') t.lexer.xref = [] t.lexer.xref_start = t.lexpos

Following this line shall be one or more cross-reference subsections, which may appear in any order.
@TOKEN(r'[0-9]+[ ][0-9]+[‘ + white_spaces_r +’]*’+eol)

def t_xref_SUBXREF(t): n = t.value.split(" ") t.lexer.xref.append(((int(n[0],10),int(n[1],10)),[])) def t_xref_XREFENTRY(t): r'\d{10}[ ]\d{5}[ ][nf](\x20\x0D|\x20\x0A|\x0D\x0A)' n = t.value.strip().split(" ") t.lexer.xref[len(t.lexer.xref)-1][1].append((int(n[0],10), int(n[1],10), n[2]))

Anything that do not match the last 3 rules is a get-out-of-here indicator…

def t_xref_out(t): r'.' t.lexer.pop_state() t.type = 'XREF' t.value = t.lexer.xref t.lexer.lexpos -= 1 t.lexpos=t.lexer.xref_start return t

File Trailer

The trailer of a PDF file enables a conforming reader to quickly find the cross-reference table and certain special objects. Conforming readers should read a PDF file from its end. The last line of the file shall contain only the end-of-file marker, %%EOF. The two preceding lines shall contain, one per line and in order, the keyword startxref and the byte offset in the decoded stream from the beginning of the file to the beginning of the xref keyword in the last cross-reference section. The startxref line shall be preceded by the trailer dictionary, consisting of the keyword trailer followed by a series of key-value pairs enclosed in double anglebrackets (<>).
Thus, the trailer has the following overall structure:

trailer
<>
startxref
Byte_offset_of_last_cross-reference_section
%%EOF

So we just need to add this 3 last tokens…

t_TRAILER = r'trailer' @TOKEN(r'startxref'+ '['+white_spaces_r+']+[0-9]+') def t_STARTXREF(t): t.value = int(t.value[10:],10) return t t_EOF = r'%%EOF'

Ok… that’s pretty much it. Bored? Well I am.
Where to go from here?
The bar?
Or the parser?

f/

Advertisement

6 Responses to “Lexing PDF, just for the not-fun of it.”

  1. jduck said

    Nice work! Looking forward to seeing your next card 🙂

  2. […] a test run… Most of the work OPAF! will hide from you is outlined in our earlier posts about scanning a pdf, parsing a pdf and also the one discussing the caveats in the actual PDF ISO standard here.. […]

  3. Matias Zurbriggen said

    Cool piece of work!
    looking forward to seeing the next post!

  4. Great work! Something solid like this needs to be shared and worked on together. I keep seeing one off implementations that don’t account for everything like this.

    Just a few things I noticed:

    I remember watching Julia Wolf speak on the whole issue with identifying the end of an object and one of the things she discovered was that the length of the object didn’t have to actually match the length. It seems like this could cause a problem for your parser given it falls back on the length to identify the end of the object.

    Part of me also thought that named objects could use octal on top of standard ascii and hex. I didn’t see anything to translate that either in the parser.

    • feliam said

      The issue about the /Lenth not having to match the endstream tag seems to be an Adobe Reader(and probably others) implementation thing . The spec say otherwise

      #7.3.8:Stream Objects :
      The sequence of bytes that make up a stream lie between the end-of-line marker following the stream keyword and the endstream keyword; the stream dictionary specifies the exact number of bytes. There should be an end-of-line marker after the data and before endstream; this marker shall not be included in the stream length. There shall not be any extra bytes, other than white space, between endstream and endobj.

      That pretty much leave us with two options, trying to match every spec deviation implemented in any reader or stick to the specification and rule out the docs that wont parse as specified. It depends on what your parser is for. Probably in the former it will be a lot of benign documents that wont parse.
      There are in fact more ways of identifying a stream length (/Length, stream/endstream and xref) which do not need to match for the pdf to be parsed by say.. abobe.

      Not sure if I understand what you said about named objects. Couldn’t find anything pointing to octal encoded pdf names.. (7.3.5 Name Objects?), yet.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: