Opaf!
August 23, 2010
It’s an Open PDF Analysis Framework!
Its written in python using PLY parser generator. The project page is here and you can get the code from here:
Keep reading for a test run…
Most of the work OPAF! will hide from you is outlined in our earlier posts about scanning a pdf, parsing a pdf and also the one discussing the caveats in the actual PDF ISO standard here.. Besides the straight forward natural parsing algorithm the lib also tries a brute force algorithm based on just few tokens. Let’s take a look of what it can already do…
Well, you first need a shady pdf like this one. This is not any alien PDF and that’s nothing really malicious about it. It even look plain…
… but if you try to open it with a tex/hex editor it stop being so friendly…
Here is where you get to try the OPAF! thing. Get the code and the pdf, solve the dependencies an run it like this..
it will generate a graph like the following for your ammusment..
That shows the minimalistic logical structure of this PDF. Note that you may get really big graphs here with other pdf samples.I have tried up to 3k nodes. Thats fun! But sadly not very useful. But that’s not all! It also gets you an XML representation of the pdf. This XML will look like this…
After this step, well you pretty much put in the game every known xml technology. XPATH being the most notable one when searching for specific things. In the project, the small, young, not finished, work in progress flagged, not really well coded project there are some examples of what you can do when got the pdf in its xml form. Use it, ignore it, patch it(lots of basic things to be done yet). Its open source!!! f/
–update–
Made a snapshot for you, download it here. Also in the news, the main tool now accepts some basic arguments…
/opaf $ python opaf.py --help Usage: opaf.py [options] Options: -h, --help show this help message and exit -x XML, --xmlfile=XML Generate an xml file. -l LOG, --logfile=LOG Dump log messages to LOG file. -i, --interactive Throw interactive python shell -g GRAPH, --graph=GRAPH Generate and dump graph to GRAPH. -d, --decompress Apply a filter pack to decompress and parse objec streams.
Also check out the next post for an example use; Taking statistics on pdf fuzzing databases with OPAF!. http://wp.me/pLJYx-7Q
Hi,
I just noticed that for some distros (namely archlinux) it’s neccessary to have lxml as dependency, as well
I wonder how the parser and XML generator handles “recursive” scenarios such as (page->annots->parent->page…)
The XML has its own version of a PDFReference. Check out the following example, this diccionary…
<< /Root R 2 0 >>
will be translated into its XML version ..
<dictionary>
<dictionary_entry>
<name payload=”Root”/> <R payload=”(2, 0)”/>
</dictionary_entry>
<dictionary>
It just point to the reference so if there was a loop it’s not really a problem.
All this is in early stage and I’m not such an XML guy but hopefully the framework will have an xml schema/DTD and all. We’ll see..