miniPDF.py reinventing the wheel of PDF evilness
January 12, 2010
Last year I coded a mini PDF rendering library from scratch. Mostly as a way to go through all PDF spec and learn something about it. Nowadays you’ll find other probably better options for managing PDF programatically like pyPDF or PDF::Writer. Anyway mine is here. (Also presented at uCon in Feb 2009. Slides here)
It supports only the most basic file structure (PDF3200:7.5.1), that’s it without incremental updates or linearization.
• A one-line header identifying the version of the PDF specification to which the file conforms
• A body containing the objects that make up the document contained in the file • A cross-reference table containing information about the indirect objects in the file • A trailer giving the location of the cross-reference table and of certain special objects within the body of the file |
Also all base PDF types: null, Object references, strings, numbers, arrays and dictionaries.
A minimal text displaying PDF will have this structure (BUG:the resources dictionary should be hanging from a page(I think))…
and will have this look in python.
First we import the lib and create a PDFDoc object representing a document in memory …
doc = PDFDoc()
As shown in the last figure the main object is the Catalog. The next 3 lines construct a Catalog Dictionary object, add it to the document and set it as the root object…
doc.add(catalog)
doc.setRoot(catalog)
And the Pages dictionary..
catalog.add(“Pages”, PDFRef(pages))
doc.add(pages)
At this point we don’t even have a valid pdf but this is how the output will look like …
%PDF-1.3
%....
1 0 obj
<<<>> >>
endobj
xref
0 2
0000000000 65535 f
0000000015 00000 n
trailer
<>
startxref
78
%%EOF
Let’s add a page with some content. Forst we add the contents stream to the document…
/F1 24 Tf
240 700 Td
(Pedefe Pedefeito Pedefeon!) Tj
ET”’)
doc.add(contents)
And then the page, referencing the already added contents…
page.add(“Contents”, PDFRef(contents))
doc.add(page)
And link the page in the Pages dictionary …
pages.add(“Count”, PDFNum(1))
#add parent reference in page
page.add(“Parent”,PDFRef(pages))
But we have used a font named /F1 in the contents stream so we need to add it to the doc…
font.add(“Name”, PDFName(“F1”))
font.add(“Subtype”, PDFName(“Type1”))
font.add(“BaseFont”, PDFName(“Helvetica”))
Mapped in the font-name map dictionary…
fontname.add(“F1”,font)
And finally link the font in the page resources dictionary…
resources.add(“Font”,fontname)
page.add(“Resources”,resources)
That’s it. Let’s print it to stdout
To try it..
python mkTXTPDF.py >test.pdf
You can download a test bundle here. And the generated pdf, here.
f/
Nice write up on PDF internals and the miniPDF library. As you say, there are now alternatives, but miniPDF was all I needed. Thanks!