PDF, A broken Spec!
August 14, 2010
(Or why I can’t parse a PDF)
|This post is about the difficulties I ran into when trying to write a PDF parser. It’s my opinion that
PDF specification is broken because it permits the token “endstream” inside a stream!
There are 4 ways of deciding the size of a PDF stream:
[+] Scanning for the “endstream” token
 Scanning for the endstream token
 Get the size from the direct \Length entry
 Get the indirect \Length using the normal xref
 Calculate the size from the starting marks pointed from the Normal cross-reference
What happens in actual PDF implementations if:
[+] Cross-reference is broken?
[+] Cross-reference point to overlapped objects
[+] Streams contains the endstream token
[+] Streams contains some evil endstream/endobj token combination
[+] If all the 4(or more) ways of parsing a PDF stream are present, should they be all consistent?
And finally, is this file PDF compliant? I bet someone may construct an obfuscation method based in this “issues”.
If you still think this is worth reading check out the following details and please comment if you find bug if you have a solution for the problems I stated here.
A PDF stream Must be an indirect object. An indirect object is a PDF object enclosed between the keywords obj and endobj. If the following indirect object happens to be in your pdf:
then any reference of the form “R 100 0” appearing in the PDF will reference the number 1234567789. Everything seams clean for indirect numbers and the other basic types like strings, arrays and even dictionaries. The problem arises with the PDF streams
A stream object, like a string object, is a sequence of bytes. A stream shall consist of a dictionary followed by zero or more bytes bracketed between the keywords stream(followed by newline) and endstream.
A stream will look like this…
 Scan for the next endstream
The first naive approach when parsing PDF stream is to consume the dictionary …
… then check if you have a stream keyword and scan until you get an endstream
But, what if you for some reason you want to have the string “endstream” inside the PDF stream. Well something will obviously go wrong. Just try to naive-parse the following stream (wich contains the endstream string inside its payload):
You’ll get a stream shorter than it should be followed by some binary garbage left out the stream lmits.
That’s wrong by specification. A PDF stream MUST be an indirect object. So it MUST be also enclosed inside the obj N M, endobj tokens, like this:
Interesting but, that’s not going to fix the problem because we can also put the endobj keyword inside the binary stream. In fact we can simulate a complete trailing PDF structure inside the stream. Try to parse this by hand (ignore the \Length for now)…
It should be interpreted as a stream containing this binary payload…
NOTE: poppler,xpdf, and adobe parse it correctly no matter the bugging “endstream”.
Yeah right. The only thing that gets clear here is the fact that we can not rely “only” in the appearing of the stream,endstream,obj,endobj keywords. We need something else.
 The mandatory /Length keyword.
Each stream object MUST have a /Length keyword in its dictionary for solving the ambiguities and speeding the scanning process. The /Length keyword must be a number indicating the amount of bytes in the stream. If we know the length we can “seek” until near the end of the stream payload and just check for the existence of endstream keyword.
Caveat 1: What happens when there is not an endstream keyword where it’s suppose to be one.
Caveat 2: As a way to facilitate the production of PDF files they let the Length value to be potentially an indirect reference to a number. That’s very useful when producing a PDF stream. This way you can procrastinate the setting of the length until you have already put the (potentially compressed) stream of bytes in place and then produce the size.
[+] Put a reference to a not yet defined length in the dictionary
[+] Put the dictionary
[+] Produce the stream
[+] Set the length in the referenced indirect object
So, for parsing a stream object we need to get another indirect object. Indirect objects are defined with obj and endobj keywords. But obj and endobj could appear inside a stream too. Deadlock? Or there is another hidden card in the spec?..
 The Normal Cross reference.
The PDF cross reference is the fastest way to know where certain indirect pdf object starts! It comes in too flavours, normal XREF and a stream XREF.
But first we need to find where the XREF is placed. That is done with the help of the startxref keyword. This keyword must appear almost at the end of the file and point to the byte position of the trailer an cross reference. Check out the section 7.5.5 of the spec (PDF3200::7.5.5) for more detail. A pdf should end like this.
The spec suggests that conforming readers should read a PDF file from its end. Once you have the cross reference you know where the different indirect objects start. Also if you assume every cross-referenced position points only to one well defined object, you may after some calculation determine the size of every object. This will be the third way of determining a pdf stream length. What happens if this way doesn’t match the others?
 The Cross Reference Stream.
There are also cross-references streams. Cross-reference streams are stream objects, and contain a dictionary and a data stream. Each cross-reference stream contains the information equivalent to the cross-reference table and trailer for one cross-reference section.
The value following the startxref keyword shall be the offset of the cross-reference stream rather than the xref keyword. For files that use cross-reference streams entirely, the keywords xref and trailer shall no longer be used. Therefore, with the exception of the startxref address %%EOF segment and comments, a file may be entirely a sequence of objects.
So there is a way, the modern way, to hold cross references in potentially compressed pdf streams in the middle of the file. How do we parse this pdf stream? We don’t have the cross reference trick for getting the length of this stream. So we could do the buggy scan-to-the-next-endstream way or the \Length way. But is the \Length entry in the cross reference stream indirect? The spec enforces that some of the entries in the XStream dictionary not to be indirect, but not the /Length. ok, timeout. Head about to explode alert, hurn hurn!!
The Linearyzed hell.
More research need to be done on this one. We’ll just quote a bit of the spec on this matter…
”’For pedagogical reasons the linearized PDF is considered to be composed from 11 parts…”’