PDF sequential parsing

August 22, 2010


As discussed in earlier posts the problem with PDF is that we can not apply an out-of-the-box scanner/parser design pattern. It won’t let you scan it properly. The size of a PDF stream is hard to be decided at scanner/lexer time. I’ve suggested the solution of escaping the “endstream” keyword. Also other patches emerged like, forcing the /Length keyword to be direct. Or calculate every object size using XREFs pointers (assuming not garbage between the objs (which in fact is what the spec says)).

Well in any case if you manage to run a lexer and tokenize it here you have the parsing grammar … weeee!!

object : NAME | STRING | HEXSTRING | NUMBER | TRUE | FALSE | NULL | R | dictionary | array 

dictionary : DOUBLE_LESS_THAN_SIGN dictionary_entry_list DOUBLE_GREATER_THAN_SIGN 

dictionary_entry_list : NAME object dictionary_entry_list
                      | empty  

array : LEFT_SQUARE_BRACKET object_list RIGHT_SQUARE_BRACKET 

object_list : object object_list 
            | empty

indirect : indirect_object_stream
         | indirect_object 

indirect_object : OBJ object ENDOBJ 
indirect_object_stream : OBJ dictionary STREAM_DATA ENDOBJ 

xref : indirect_object_stream 
     | XREF TRAILER dictionary 

pdf : HEADER pdf_update_list
pdf_update_list : pdf_update_list body xref pdf_end
                | body xref pdf_end

body : body indirect_object 
     | body indirect_object_stream 
     | empty

pdf_end : STARTXREF EOF

f/

About these ads

3 Responses to “PDF sequential parsing”

  1. Leonard Rosenthol said

    For some small subset of PDFs, this approach will work. But as you noted – it requires a bunch of “exception” that would be avoided entirely if you used the xrefs as they are meant to be used.

  2. feliam said

    I’m still waiting for the spec quote where it says that it accepts things outside that grammar.
    a.k.a “garbage between objects” or “overlapped objects”

    In practice.. well yes you would need to handle out-of-spec bits anyway.

  3. [...] Most of the work OPAF! will hide from you is outlined in our earlier posts about scanning a pdf, parsing a pdf and also the one discussing the caveats in the actual PDF ISO standard here.. Besides the straight [...]

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Follow

Get every new post delivered to your Inbox.

%d bloggers like this: