PDF stats

August 26, 2010

Follow feliam on Twitter

This is an example use of the opaflib. The script described here use opaflib to get some statistics about the different PDF objects that appear in you file stash. This 2 charts show the appearing frequencies of Filters and Object types in a 10Mbyte small database of a random pdf selection.

So it is better for your fuzzing base that this numbers seem even, otherwise you’ll be testing the same thing over and over.

Keep reading for more exciting details!!! WEEEEEEEEEEEE!

OK, lets gos step by step through the py script that does this. First, import the OPAF!(beta) library. Get it/ contribute here)

from opaflib import *

Initialize the counters…

types,filters = {} , {}
bytes, iobjects, streams, fstreams, cobjects = [], [], [], [], []

Read the pdf from stdin and parse it using normal parser from OPAF!…

xml_pdf = normalParser(sys.stdin.read())

Find, expand and parse every ObjStm …

for objstm in xml_pdf.xpath(

WOAH! What’s that??
That’s XPATH, for some reason XML ppl loves it. Well, it apply conditions over the xml nodes and eventually that selects exactly the set of nodes you want. Let’s dissect the XPATH for getting all streams with compressed objects.
We need any appearing object stream …


… that has a dictionary ….


… with at least one dictionary entry…


… with a key named “Type” …

   /name[@payload=enc("Type") ...]

… which is followed by a value ObjStm.


That way we select exactly any ‘ObjStm’ value of a key named ‘Type’ inside a dictionary entry in a dictionary in an indirect object. We backtrack a little to select the indirect object stream itself…


and thats it. 🙂 The xpath thing.

Now it counts any filtered stream. Looking for every pdf stream that has a /Filter key…

for xml_fi in xml_pdf.xpath('//indirect_object_stream/dictionary/'+
    if xml_fi.tag == 'array':
        fis = [payload(x) for x in xml_fi]
    elif xml_fi.tag == 'name':
        fis = [payload(xml_fi)]
        fis = []
    for fi in fis:
        filters[fi] = filters.get(fi,0)+1

Count every different object type on the file. That’s it every object which has a ‘Type’ key in its dictionary…

for xml_ty in xml_pdf.xpath('//dictionary/dictionary_entry'+
    ty = payload(xml_ty)
    types[ty] = types.get(ty,0)+1

Count all indirect objects…

iobjects = xml_pdf.xpath('//indirect_object')

All streams on the file…

streams = xml_pdf.xpath('//indirect_object_stream')

All filtered streams…

fstreams = xml_pdf.xpath('//indirect_object_stream'+

And all objects which were previously compressed and now there are child of a root level stream.


And finally print statistics to stdout.

print "Total number of parsed bytes: %s"%len(pdf)
print "Total number of indirect objects: %s"%len(iobjects)
print "Total number of streams: %s"%len(streams)
print "Total number of filtered streams: %s"%len(fstreams)
print "Total number of compressed objects: %s"%len(cobjects)
print "Object Filter frequencies: %s"%repr(filters)
print "Object Type frequencies: %s"%repr(types)

Aversion of this script is here (you need opaf). You run it like this…

python stats.py file1.pdf

… and it should give you something like the following if parsing was ok and all the other beta stuff went ok too.

Total number of parsed files: 82
Total number of parsed bytes: 100452601 [avg:1225031.71951]
Total number of indirect objects: 55928 [avg:682.048780488]
Total number of streams: 11726 [avg:143.0]
Total number of filtered streams: 10382 [avg:126.609756098]
Total number of compressed objects: 9093 [avg:110.890243902]
Object Filter frequencies:
{'A85': 1,
  'ASCII85Decode': 163,
  'CCITTFaxDecode': 128,
  'JBIG2Decode': 2,
  'LZWDecode': 559,
  'FlateDecode': 9075,
  'DCTDecode': 608,
  'JPXDecode': 3}

Object Type frequencies:
{'XObject': 2699,
  'Group': 45,
  'Pattern': 3,
  'PropertyList': 1,
  'OCG': 12,
  'OBJR': 3,
  'OCMD': 7,
  'ObjStm': 204,
  'Metadata': 161,
  'FileSpec': 159,
  'ExtGState': 598,
  'Halftone': 12,
  'Catalog': 95,
  'ViewerPreferences': 1,
  'Outlines': 18,
  'Filespec': 10,
  'Mask': 8,
  'Annot': 5300,
  'StructTreeRoot': 13,
  'FontDescriptor': 955,
  'Action': 32,
  'Page': 2962,
  'XRef': 24,
  'Encoding': 219,
  'EmbeddedFile': 4,
  'Pages': 427,
  'Font': 1473,
  'JobTicketContents': 1}

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: