miniPDF.py reinventing the wheel of PDF evilness

January 12, 2010

Last year I coded a mini PDF rendering library from scratch. Mostly as a way to go through all PDF spec and learn something about it. Nowadays you’ll find other probably better options for managing PDF programatically like pyPDF or PDF::Writer. Anyway mine is here. (Also presented at uCon in Feb 2009. Slides here)

It supports only the most basic file structure (PDF3200:7.5.1), that’s it without incremental updates or linearization.

• A one-line header identifying the version of the PDF specification to which the file conforms

• A body containing the objects that make up the document contained in the file

• A cross-reference table containing information about the indirect objects in the file

• A trailer giving the location of the cross-reference table and of certain special objects within the body of the file

Also all base PDF types: null, Object references, strings, numbers, arrays and dictionaries.

A minimal text displaying PDF will have this structure (BUG:the resources dictionary should be hanging from a page(I think))…

and will have this look in python.

First we import the lib and create a PDFDoc object representing a document in memory …

from miniPDF import *
doc = PDFDoc()

As shown in the last figure the main object is the Catalog. The next 3 lines construct a Catalog Dictionary object, add it to the document and set it as the root object…

catalog = PDFDict({“Type”: PDFName(“Catalog”)})
doc.add(catalog)
doc.setRoot(catalog)

And the Pages dictionary..

pages = PDFDict({“Type”,PDFName(“Pages”)})
catalog.add(“Pages”, PDFRef(pages))
doc.add(pages)

At this point we don’t even have a valid pdf but this is how the output will look like …

%PDF-1.3 %.... 1 0 obj <<<>> >> endobj xref 0 2 0000000000 65535 f 0000000015 00000 n trailer <> startxref 78 %%EOF

Let’s add a page with some content. Forst we add the contents stream to the document…

contents=PDFStream(”’BT
/F1 24 Tf
240 700 Td
(Pedefe Pedefeito Pedefeon!) Tj
ET”’)
doc.add(contents)

And then the page, referencing the already added contents…

page = PDFDict({“Type”:PDFName(“Page”)})
page.add(“Contents”, PDFRef(contents))
doc.add(page)

And link the page in the Pages dictionary …

pages.add(“Kids”, PDFArray([PDFRef(page)]))
pages.add(“Count”, PDFNum(1))
#add parent reference in page
page.add(“Parent”,PDFRef(pages))

But we have used a font named /F1 in the contents stream so we need to add it to the doc…

font = PDFDict()
font.add(“Name”, PDFName(“F1”))
font.add(“Subtype”, PDFName(“Type1”))
font.add(“BaseFont”, PDFName(“Helvetica”))

Mapped in the font-name map dictionary…

fontname = PDFDict()
fontname.add(“F1”,font)

And finally link the font in the page resources dictionary…

resources = PDFDict()
resources.add(“Font”,fontname)
page.add(“Resources”,resources)

That’s it. Let’s print it to stdout

print doc

To try it..
python mkTXTPDF.py >test.pdf
You can download a test bundle here. And the generated pdf, here.

Posted by feliam

Filed in pdf, security ·Tags: pdf, python, security

1 Comment »

One Response to “miniPDF.py reinventing the wheel of PDF evilness”

lazaruslair said
February 15, 2012 at 5:39 pm
Nice write up on PDF internals and the miniPDF library. As you say, there are now alternatives, but miniPDF was all I needed. Thanks!

Reply

Feliam's Blog