PDF sequential parsing

August 22, 2010

As discussed in earlier posts the problem with PDF is that we can not apply an out-of-the-box scanner/parser design pattern. It won’t let you scan it properly. The size of a PDF stream is hard to be decided at scanner/lexer time. I’ve suggested the solution of escaping the “endstream” keyword. Also other patches emerged like, forcing the /Length keyword to be direct. Or calculate every object size using XREFs pointers (assuming not garbage between the objs (which in fact is what the spec says)).

Well in any case if you manage to run a lexer and tokenize it here you have the parsing grammar … weeee!!

object : NAME | STRING | HEXSTRING | NUMBER | TRUE | FALSE | NULL | R | dictionary | array 

dictionary : DOUBLE_LESS_THAN_SIGN dictionary_entry_list DOUBLE_GREATER_THAN_SIGN 

dictionary_entry_list : NAME object dictionary_entry_list
                      | empty  

array : LEFT_SQUARE_BRACKET object_list RIGHT_SQUARE_BRACKET 

object_list : object object_list 
            | empty

indirect : indirect_object_stream
         | indirect_object 

indirect_object : OBJ object ENDOBJ 
indirect_object_stream : OBJ dictionary STREAM_DATA ENDOBJ 

xref : indirect_object_stream 
     | XREF TRAILER dictionary 

pdf : HEADER pdf_update_list
pdf_update_list : pdf_update_list body xref pdf_end
                | body xref pdf_end

body : body indirect_object 
     | body indirect_object_stream 
     | empty

pdf_end : STARTXREF EOF

Posted by feliam

Filed in pdf, security ·Tags: parser, pdf, security

3 Comments »

3 Responses to “PDF sequential parsing”

Leonard Rosenthol said
August 24, 2010 at 8:14 pm
For some small subset of PDFs, this approach will work. But as you noted – it requires a bunch of “exception” that would be avoided entirely if you used the xrefs as they are meant to be used.

Reply
feliam said
August 24, 2010 at 8:28 pm
I’m still waiting for the spec quote where it says that it accepts things outside that grammar.
a.k.a “garbage between objects” or “overlapped objects”

In practice.. well yes you would need to handle out-of-spec bits anyway.

Reply
Opaf! « Feliam's Blog said
August 26, 2010 at 6:09 pm
[…] Most of the work OPAF! will hide from you is outlined in our earlier posts about scanning a pdf, parsing a pdf and also the one discussing the caveats in the actual PDF ISO standard here.. Besides the straight […]

Reply

Feliam's Blog