PDF sequential parsing
August 22, 2010
As discussed in earlier posts the problem with PDF is that we can not apply an out-of-the-box scanner/parser design pattern. It won’t let you scan it properly. The size of a PDF stream is hard to be decided at scanner/lexer time. I’ve suggested the solution of escaping the “endstream” keyword. Also other patches emerged like, forcing the /Length keyword to be direct. Or calculate every object size using XREFs pointers (assuming not garbage between the objs (which in fact is what the spec says)).
Well in any case if you manage to run a lexer and tokenize it here you have the parsing grammar … weeee!!
object : NAME | STRING | HEXSTRING | NUMBER | TRUE | FALSE | NULL | R | dictionary | array dictionary : DOUBLE_LESS_THAN_SIGN dictionary_entry_list DOUBLE_GREATER_THAN_SIGN dictionary_entry_list : NAME object dictionary_entry_list | empty array : LEFT_SQUARE_BRACKET object_list RIGHT_SQUARE_BRACKET object_list : object object_list | empty indirect : indirect_object_stream | indirect_object indirect_object : OBJ object ENDOBJ indirect_object_stream : OBJ dictionary STREAM_DATA ENDOBJ xref : indirect_object_stream | XREF TRAILER dictionary pdf : HEADER pdf_update_list pdf_update_list : pdf_update_list body xref pdf_end | body xref pdf_end body : body indirect_object | body indirect_object_stream | empty pdf_end : STARTXREF EOF
f/
For some small subset of PDFs, this approach will work. But as you noted – it requires a bunch of “exception” that would be avoided entirely if you used the xrefs as they are meant to be used.
I’m still waiting for the spec quote where it says that it accepts things outside that grammar.
a.k.a “garbage between objects” or “overlapped objects”
In practice.. well yes you would need to handle out-of-spec bits anyway.
[…] Most of the work OPAF! will hide from you is outlined in our earlier posts about scanning a pdf, parsing a pdf and also the one discussing the caveats in the actual PDF ISO standard here.. Besides the straight […]