August 23, 2010

It’s an Open PDF Analysis Framework!

A pdf file rely on a complex file structure constructed from a set tokens, and grammar rules. Also each token being potentially compressed, encrypted or even obfuscated. Open PDF Analysis Framework will understand, decompress, de-obfuscate this basic pdf elements and present the resulting soup as a clean XML tree(done!). From there the idea is to compile a set of rules that can can be used to decide what to keep, what to cut out and ultimately if it is safe to open the resulting pdf projection(todo!).

Its written in python using PLY parser generator. The project page is here and you can get the code from here:

svn checkout http://opaf.googlecode.com/svn/trunk/ opaf-read-only

Keep reading for a test run…

Most of the work OPAF! will hide from you is outlined in our earlier posts about scanning a pdf, parsing a pdf and also the one discussing the caveats in the actual PDF ISO standard here.. Besides the straight forward natural parsing algorithm the lib also tries a brute force algorithm based on just few tokens. Let’s take a look of what it can already do…

Well, you first need a shady pdf like this one. This is not any alien PDF and that’s nothing really malicious about it. It even look plain…

… but if you try to open it with a tex/hex editor it stop being so friendly…

Here is where you get to try the OPAF! thing. Get the code and the pdf, solve the dependencies an run it like this..

python opaf.py textg.pdf

it will generate a graph like the following for your ammusment..

That shows the minimalistic logical structure of this PDF. Note that you may get really big graphs here with other pdf samples.I have tried up to 3k nodes. Thats fun! But sadly not very useful. But that’s not all! It also gets you an XML representation of the pdf. This XML will look like this…

After this step, well you pretty much put in the game every known xml technology. XPATH being the most notable one when searching for specific things. In the project, the small, young, not finished, work in progress flagged, not really well coded project there are some examples of what you can do when got the pdf in its xml form. Use it, ignore it, patch it(lots of basic things to be done yet). Its open source!!! f/

Made a snapshot for you, download it here. Also in the news, the main tool now accepts some basic arguments…

/opaf $ python opaf.py  --help
Usage: opaf.py [options]

  -h, --help            show this help message and exit
  -x XML, --xmlfile=XML
                        Generate an xml file.
  -l LOG, --logfile=LOG
                        Dump log messages to LOG file.
  -i, --interactive     Throw interactive python shell
  -g GRAPH, --graph=GRAPH
                        Generate and dump graph to GRAPH.
  -d, --decompress      Apply a filter pack to decompress and parse objec

Also check out the next post for an example use; Taking statistics on pdf fuzzing databases with OPAF!. http://wp.me/pLJYx-7Q


3 Responses to “Opaf!”

  1. alu said

    I just noticed that for some distros (namely archlinux) it’s neccessary to have lxml as dependency, as well

  2. Leonard Rosenthol said

    I wonder how the parser and XML generator handles “recursive” scenarios such as (page->annots->parent->page…)

  3. feliam said

    The XML has its own version of a PDFReference. Check out the following example, this diccionary…

    << /Root R 2 0 >>

    will be translated into its XML version ..

    <name payload=”Root”/> <R payload=”(2, 0)”/>

    It just point to the reference so if there was a loop it’s not really a problem.
    All this is in early stage and I’m not such an XML guy but hopefully the framework will have an xml schema/DTD and all. We’ll see..

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: