PDF sequential parsing
August 22, 2010
As discussed in earlier posts the problem with PDF is that we can not apply an out-of-the-box scanner/parser design pattern. It won’t let you scan it properly. The size of a PDF stream is hard to be decided at scanner/lexer time. I’ve suggested the solution of escaping the “endstream” keyword. Also other patches emerged like, forcing the /Length keyword to be direct. Or calculate every object size using XREFs pointers (assuming not garbage between the objs (which in fact is what the spec says)).
Well in any case if you manage to run a lexer and tokenize it here you have the parsing grammar … weeee!!
object : NAME | STRING | HEXSTRING | NUMBER | TRUE | FALSE | NULL | R | dictionary | array
dictionary : DOUBLE_LESS_THAN_SIGN dictionary_entry_list DOUBLE_GREATER_THAN_SIGN
dictionary_entry_list : NAME object dictionary_entry_list
| empty
array : LEFT_SQUARE_BRACKET object_list RIGHT_SQUARE_BRACKET
object_list : object object_list
| empty
indirect : indirect_object_stream
| indirect_object
indirect_object : OBJ object ENDOBJ
indirect_object_stream : OBJ dictionary STREAM_DATA ENDOBJ
xref : indirect_object_stream
| XREF TRAILER dictionary
pdf : HEADER pdf_update_list
pdf_update_list : pdf_update_list body xref pdf_end
| body xref pdf_end
body : body indirect_object
| body indirect_object_stream
| empty
pdf_end : STARTXREF EOF
f/

For some small subset of PDFs, this approach will work. But as you noted – it requires a bunch of “exception” that would be avoided entirely if you used the xrefs as they are meant to be used.
I’m still waiting for the spec quote where it says that it accepts things outside that grammar.
a.k.a “garbage between objects” or “overlapped objects”
In practice.. well yes you would need to handle out-of-spec bits anyway.
[...] Most of the work OPAF! will hide from you is outlined in our earlier posts about scanning a pdf, parsing a pdf and also the one discussing the caveats in the actual PDF ISO standard here.. Besides the straight [...]