PDF sequential parsing
August 22, 2010
As discussed in earlier posts the problem with PDF is that we can not apply an out-of-the-box scanner/parser design pattern. It won’t let you scan it properly. The size of a PDF stream is hard to be decided at scanner/lexer time. I’ve suggested the solution of escaping the “endstream” keyword. Also other patches emerged like, forcing the /Length keyword to be direct. Or calculate every object size using XREFs pointers (assuming not garbage between the objs (which in fact is what the spec says)).
Well in any case if you manage to run a lexer and tokenize it here you have the parsing grammar … weeee!!
object : NAME | STRING | HEXSTRING | NUMBER | TRUE | FALSE | NULL | R | dictionary | array dictionary : DOUBLE_LESS_THAN_SIGN dictionary_entry_list DOUBLE_GREATER_THAN_SIGN dictionary_entry_list : NAME object dictionary_entry_list | empty array : LEFT_SQUARE_BRACKET object_list RIGHT_SQUARE_BRACKET object_list : object object_list | empty indirect : indirect_object_stream | indirect_object indirect_object : OBJ object ENDOBJ indirect_object_stream : OBJ dictionary STREAM_DATA ENDOBJ xref : indirect_object_stream | XREF TRAILER dictionary pdf : HEADER pdf_update_list pdf_update_list : pdf_update_list body xref pdf_end | body xref pdf_end body : body indirect_object | body indirect_object_stream | empty pdf_end : STARTXREF EOF