miniPDF.py reinventing the wheel of PDF evilness

January 12, 2010


Last year I coded a mini PDF rendering library from scratch. Mostly as a way to go through all PDF spec and learn something about it. Nowadays you’ll find other probably better options for managing PDF programatically like pyPDF or PDF::Writer. Anyway mine is here. (Also presented at uCon in Feb 2009.  Slides here)

It supports only the most basic file structure (PDF3200:7.5.1), that’s it without incremental updates or linearization.

• A one-line header identifying the version of the PDF specification to which the file conforms

• A body containing the objects that make up the document contained in the file

• A cross-reference table containing information about the indirect objects in the file

• A trailer giving the location of the cross-reference table and of certain special objects within the body of the file

Also all base PDF types: null, Object references, strings, numbers, arrays and dictionaries.

A minimal text displaying PDF will have this structure (BUG:the resources dictionary should be hanging from a page(I think))…

and will have this look in python.

First we import the lib and create a PDFDoc object representing a document in memory …

from miniPDF import *
doc = PDFDoc()

As shown in the last figure the main object is the Catalog. The next 3 lines construct a Catalog Dictionary object, add it to the document and set it as the root object…

catalog = PDFDict({“Type”: PDFName(“Catalog”)})
doc.add(catalog)
doc.setRoot(catalog)

And the Pages dictionary..

pages = PDFDict({“Type”,PDFName(“Pages”)})
catalog.add(“Pages”, PDFRef(pages))
doc.add(pages)

At this point we don’t even have a valid pdf but this is how the output will look like …


%PDF-1.3
%....
1 0 obj
<<<>> >>
endobj
xref
0 2
0000000000 65535 f
0000000015 00000 n
trailer
<>
startxref
78
%%EOF

Let’s add a page with some content. Forst we add the contents stream to the document…

contents=PDFStream(”’BT
/F1 24 Tf
240 700 Td
(Pedefe Pedefeito Pedefeon!) Tj
ET”’)
doc.add(contents)

And then the page, referencing the already added contents…

page = PDFDict({“Type”:PDFName(“Page”)})
page.add(“Contents”, PDFRef(contents))
doc.add(page)

And link the page in the Pages dictionary …

pages.add(“Kids”, PDFArray([PDFRef(page)]))
pages.add(“Count”, PDFNum(1))
#add parent reference in page
page.add(“Parent”,PDFRef(pages))

But we have used a font named /F1 in the contents stream so we need to add it to the doc…

font = PDFDict()
font.add(“Name”, PDFName(“F1”))
font.add(“Subtype”, PDFName(“Type1”))
font.add(“BaseFont”, PDFName(“Helvetica”))

Mapped in the font-name map dictionary…

fontname = PDFDict()
fontname.add(“F1”,font)

And finally link the font in the page resources dictionary…

resources = PDFDict()
resources.add(“Font”,fontname)
page.add(“Resources”,resources)

That’s it. Let’s print it to stdout

print doc

To try it..

python mkTXTPDF.py >test.pdf

You can download a test bundle here. And the generated pdf, here.

f/

One Response to “miniPDF.py reinventing the wheel of PDF evilness”

  1. lazaruslair said

    Nice write up on PDF internals and the miniPDF library. As you say, there are now alternatives, but miniPDF was all I needed. Thanks!

Leave a comment