reinventing the wheel of PDF evilness

January 12, 2010

Last year I coded a mini PDF rendering library from scratch. Mostly as a way to go through all PDF spec and learn something about it. Nowadays you’ll find other probably better options for managing PDF programatically like pyPDF or PDF::Writer. Anyway mine is here. (Also presented at uCon in Feb 2009.  Slides here)

It supports only the most basic file structure (PDF3200:7.5.1), that’s it without incremental updates or linearization.

• A one-line header identifying the version of the PDF specification to which the file conforms

• A body containing the objects that make up the document contained in the file

• A cross-reference table containing information about the indirect objects in the file

• A trailer giving the location of the cross-reference table and of certain special objects within the body of the file

Also all base PDF types: null, Object references, strings, numbers, arrays and dictionaries.

A minimal text displaying PDF will have this structure (BUG:the resources dictionary should be hanging from a page(I think))…

and will have this look in python.

First we import the lib and create a PDFDoc object representing a document in memory …

from miniPDF import *
doc = PDFDoc()

As shown in the last figure the main object is the Catalog. The next 3 lines construct a Catalog Dictionary object, add it to the document and set it as the root object…

catalog = PDFDict({“Type”: PDFName(“Catalog”)})

And the Pages dictionary..

pages = PDFDict({“Type”,PDFName(“Pages”)})
catalog.add(“Pages”, PDFRef(pages))

At this point we don’t even have a valid pdf but this is how the output will look like …

1 0 obj
<<<>> >>
0 2
0000000000 65535 f
0000000015 00000 n

Let’s add a page with some content. Forst we add the contents stream to the document…

/F1 24 Tf
240 700 Td
(Pedefe Pedefeito Pedefeon!) Tj

And then the page, referencing the already added contents…

page = PDFDict({“Type”:PDFName(“Page”)})
page.add(“Contents”, PDFRef(contents))

And link the page in the Pages dictionary …

pages.add(“Kids”, PDFArray([PDFRef(page)]))
pages.add(“Count”, PDFNum(1))
#add parent reference in page

But we have used a font named /F1 in the contents stream so we need to add it to the doc…

font = PDFDict()
font.add(“Name”, PDFName(“F1”))
font.add(“Subtype”, PDFName(“Type1”))
font.add(“BaseFont”, PDFName(“Helvetica”))

Mapped in the font-name map dictionary…

fontname = PDFDict()

And finally link the font in the page resources dictionary…

resources = PDFDict()

That’s it. Let’s print it to stdout

print doc

To try it..

python >test.pdf

You can download a test bundle here. And the generated pdf, here.



One Response to “ reinventing the wheel of PDF evilness”

  1. lazaruslair said

    Nice write up on PDF internals and the miniPDF library. As you say, there are now alternatives, but miniPDF was all I needed. Thanks!

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: