miniPDF.py reinventing the wheel of PDF evilness
January 12, 2010
Last year I coded a mini PDF rendering library from scratch. Mostly as a way to go through all PDF spec and learn something about it. Nowadays you’ll find other probably better options for managing PDF programatically like pyPDF or PDF::Writer. Anyway mine is here. (Also presented at uCon in Feb 2009. Slides here)
It supports only the most basic file structure (PDF3200:7.5.1), that’s it without incremental updates or linearization.
|• A one-line header identifying the version of the PDF specification to which the file conforms
• A body containing the objects that make up the document contained in the file
• A cross-reference table containing information about the indirect objects in the file
• A trailer giving the location of the cross-reference table and of certain special objects within the body of the file
Also all base PDF types: null, Object references, strings, numbers, arrays and dictionaries.
A minimal text displaying PDF will have this structure (BUG:the resources dictionary should be hanging from a page(I think))…
and will have this look in python.
First we import the lib and create a PDFDoc object representing a document in memory …
As shown in the last figure the main object is the Catalog. The next 3 lines construct a Catalog Dictionary object, add it to the document and set it as the root object…
And the Pages dictionary..
At this point we don’t even have a valid pdf but this is how the output will look like …
1 0 obj
0000000000 65535 f
0000000015 00000 n
Let’s add a page with some content. Forst we add the contents stream to the document…
And then the page, referencing the already added contents…
And link the page in the Pages dictionary …
But we have used a font named /F1 in the contents stream so we need to add it to the doc…
Mapped in the font-name map dictionary…
And finally link the font in the page resources dictionary…
That’s it. Let’s print it to stdout