August 14, 2010
(Or why I can’t parse a PDF)
|This post is about the difficulties I ran into when trying to write a PDF parser. It’s my opinion that
PDF specification is broken because it permits the token “endstream” inside a stream!
There are 4 ways of deciding the size of a PDF stream:
[+] Scanning for the “endstream” token
 Scanning for the endstream token
 Get the size from the direct \Length entry
 Get the indirect \Length using the normal xref
 Calculate the size from the starting marks pointed from the Normal cross-reference
What happens in actual PDF implementations if:
[+] Cross-reference is broken?
[+] Cross-reference point to overlapped objects
[+] Streams contains the endstream token
[+] Streams contains some evil endstream/endobj token combination
[+] If all the 4(or more) ways of parsing a PDF stream are present, should they be all consistent?
And finally, is this file PDF compliant? I bet someone may construct an obfuscation method based in this “issues”.
If you still think this is worth reading check out the following details and please comment if you find bug if you have a solution for the problems I stated here.