Working with streams in PDFrw for Python?

问题

I'm trying to read in an example PDF with PDFrw. The PDF contains the phrase Hello Matthew in the bottom left corner at coordinates (100, 100). When I attempt to output the text (if I even can?) I get a stream of data. I can't seem to figure out how to get that as text.

>>> import pdfrw

>>> file_object = pdfrw.PdfReader("Hello.pdf")
>>> file_object
{'/ID': ['<f643bc0910dfb67725d53e11054f4609>', '<f643bc0910dfb67725d53e11054f4609>'], '/Info': (5, 0), '/Root': {'/Outl
ines': (8, 0), '/PageMode': '/UseNone', '/Pages': {'/Count': '1', '/Kids': [{'/Contents': (7, 0), '/MediaBox': ['0', '0
', '595.2756', '841.8898'], '/Parent': {...}, '/Resources': {'/Font': (1, 0), '/ProcSet': ['/PDF', '/Text', '/ImageB',
'/ImageC', '/ImageI']}, '/Rotate': '0', '/Trans': {}, '/Type': '/Page'}], '/Type': '/Pages'}, '/Type': '/Catalog'}, '/S
ize': '9'}

>>> file_object.pages[0]
{'/Contents': (7, 0), '/MediaBox': ['0', '0', '595.2756', '841.8898'], '/Parent': {'/Count': '1', '/Kids': [{...}], '/T
ype': '/Pages'}, '/Resources': {'/Font': (1, 0), '/ProcSet': ['/PDF', '/Text', '/ImageB', '/ImageC', '/ImageI']}, '/Rot
ate': '0', '/Trans': {}, '/Type': '/Page'}

>>> file_object.pages[0].keys()
['/Contents', '/MediaBox', '/Parent', '/Resources', '/Rotate', '/Trans', '/Type']

>>> file_object.pages[0].Contents
{'/Filter': ['/ASCII85Decode', '/FlateDecode'], '/Length': '102'}

>>> file_object.pages[0].Contents.stream
'GapQh0E=F,0U\\H3T\\pNYT^QKk?tc>IP,;W#U1^23ihPEM_?CW4KISi90EC-p>QkRte=<%V"lI7]P)Rn29neZ[Kb,htEWn&q7Q2"V~>'

回答1:

That stream is compressed. You can tell that by the dictionary /Filter parameter.

Unfortunately, pdfrw does not (yet?) know how to decompress with that type of filter. If you run your pdf through something like pdftk first to decompress it, you might see something more reasonable.

Disclaimer: I am the primary pdfrw author.

But...

Even then, especially for non-ASCII fonts, character to glyph mapping in PDFs is complicated, so you won't always see something that looks reasonable.

If you really want to deeply examine text PDF files, pdfminer might be more useful -- pdfrw has not yet really grown the tools to do that too well.

回答2:

If your filter is only /Flatedecode or you can find an ASCII85Decode filter to run first (they must be run in order). I have been using pdfrw.uncompress.uncompress(page.Contents) to decode /Flatedecode streams (not the sames as PdfReader.uncompress(), the method does not pass a stream to the processing function, it gives it all of the indirect_objects).

>>> pdf = pdfrw.PdfReader('foo.pdf')
>>> pages = pdf.Root.Pages.Kids
>>> p1 = pages[0]
>>> p1.Contents
{'/Filter': '/FlateDecode', '/Length': '13679'}
>>> p1.Contents.stream[:30]
'x\x9cÕ}Ý\x92æ¶\x91å½"ô\x0eu5Qß¬ëk\x02üßP8BRwË'
>>> pdfrw.uncompress.uncompress([p1.Contents]) # Contents object/s in a list.
True # it returns True even if the stream is not decoded.
>>> p1.Contents.stream[:30]
'/Artifact <</Attached [/Top]/T' # ready for parsing

Then search for lines ending in either 'TJ' or 'Tj' and take any values inside round brackets from those lines... and you have your text.

If you need location information for the text then find blocks of lines between BT and ET. Then check the line endings, if you have Tm it should be an array of 6 values [1,0,0,1,x,y] the last two numbers give you the bottom left corner of the text starting position.

来源：https://stackoverflow.com/questions/43126440/working-with-streams-in-pdfrw-for-python

标签

python

pdf

pdfrw