struct.error: unpack requires a string argument of length 16

匿名 (未验证) 提交于 2019-12-03 08:44:33

问题:

While processing a PDF file (2.pdf) with pdfminer (pdf2txt.py) I received the following error:

pdf2txt.py 2.pdf   Traceback (most recent call last):   File "/usr/local/bin/pdf2txt.py", line 115, in <module>     if __name__ == '__main__': sys.exit(main(sys.argv))   File "/usr/local/bin/pdf2txt.py", line 109, in main     interpreter.process_page(page)   File "/usr/local/lib/python2.7/dist-packages/pdfminer/pdfinterp.py", line 832, in process_page     self.render_contents(page.resources, page.contents, ctm=ctm)   File "/usr/local/lib/python2.7/dist-packages/pdfminer/pdfinterp.py", line 843, in render_contents     self.init_resources(resources)   File "/usr/local/lib/python2.7/dist-packages/pdfminer/pdfinterp.py", line 347, in init_resources     self.fontmap[fontid] = self.rsrcmgr.get_font(objid, spec)   File "/usr/local/lib/python2.7/dist-packages/pdfminer/pdfinterp.py", line 195, in get_font     font = self.get_font(None, subspec)   File "/usr/local/lib/python2.7/dist-packages/pdfminer/pdfinterp.py", line 186, in get_font     font = PDFCIDFont(self, spec)   File "/usr/local/lib/python2.7/dist-packages/pdfminer/pdffont.py", line 654, in __init__     StringIO(self.fontfile.get_data()))   File "/usr/local/lib/python2.7/dist-packages/pdfminer/pdffont.py", line 375, in __init__     (name, tsum, offset, length) = struct.unpack('>4sLLL', fp.read(16)) struct.error: unpack requires a string argument of length 16 

While the similar file (1.pdf) doesn't cause a problem.

I can't find any information about the error. I added an issue on the pdfminer GitHub repository, but it remained unanswered. Can someone explain to me why this is happening? What can I do to parse 2.pdf?


Update: I get a similar error with BytesIO instead of StringIO after installing pdfminer directly from the GitHub repository.

    $ pdf2txt.py 2.pdf  Traceback (most recent call last):   File "/home/danil/projects/python/pdfminer-source/env/bin/pdf2txt.py", line 116, in <module>     if __name__ == '__main__': sys.exit(main(sys.argv))   File "/home/danil/projects/python/pdfminer-source/env/bin/pdf2txt.py", line 110, in main     interpreter.process_page(page)   File "/home/danil/projects/python/pdfminer-source/env/local/lib/python2.7/site-packages/pdfminer/pdfinterp.py", line 839, in process_page     self.render_contents(page.resources, page.contents, ctm=ctm)   File "/home/danil/projects/python/pdfminer-source/env/local/lib/python2.7/site-packages/pdfminer/pdfinterp.py", line 850, in render_contents     self.init_resources(resources)   File "/home/danil/projects/python/pdfminer-source/env/local/lib/python2.7/site-packages/pdfminer/pdfinterp.py", line 356, in init_resources     self.fontmap[fontid] = self.rsrcmgr.get_font(objid, spec)   File "/home/danil/projects/python/pdfminer-source/env/local/lib/python2.7/site-packages/pdfminer/pdfinterp.py", line 204, in get_font     font = self.get_font(None, subspec)   File "/home/danil/projects/python/pdfminer-source/env/local/lib/python2.7/site-packages/pdfminer/pdfinterp.py", line 195, in get_font     font = PDFCIDFont(self, spec)   File "/home/danil/projects/python/pdfminer-source/env/local/lib/python2.7/site-packages/pdfminer/pdffont.py", line 665, in __init__     BytesIO(self.fontfile.get_data()))   File "/home/danil/projects/python/pdfminer-source/env/local/lib/python2.7/site-packages/pdfminer/pdffont.py", line 386, in __init__     (name, tsum, offset, length) = struct.unpack('>4sLLL', fp.read(16)) struct.error: unpack requires a string argument of length 16 

回答1:

TL; DR

Thanks to @mkl and @hynecker for the extra info... With that I can confirm this is a bug in pdfminer and your PDF. Whenever pdfminer tries to get embedded file streams (e.g. font definitions), it is picking up the last one in the file before an endobj. Sadly, not all PDFs rigorously add the end tag and so pdfminer should be resilient to this.

Quick fix for this issue

I've created a patch - which has been submitted as a pull request on github. See https://github.com/euske/pdfminer/pull/159.

Detailed diagnosis

As mentioned in the other answers, the reason you're seeing this is that you're not getting the expected number of bytes from the stream as pdfminer is unpacking the data. But why?

As you can see in your stack trace, pdfminer (rightly) spots that it has a CID font to process. It then goes on to process the embedded font file as a TrueType font (in pdffont.py). It tries to parse the associated stream (stream ID 18) by reading out a set of binary tables.

This doesn't work for 2.pdf because it has a text stream. You can see this by running dumppdf -b -i 18 2.pdf. I've put the start here:

/CIDInit /ProcSet findresource begin 12 dict begin begincmap /CIDSystemInfo << /Registry (Adobe) /Ordering (UCS) /Supplement 0 >> def /CMapName /Adobe-Identity-UCS def ... 

So, garbage in, garbage out... Is this a bug in your file or pdfminer? Well, the fact that other readers can handle it made me suspicious.

Digging around a little more, I see that this stream is identical to stream ID 17, which is the cmap for the ToUnicode field. A quick look at the PDF spec shows that these cannot be the same.

Digging in to the code further, I see that all streams are getting the same data. Oops! This is the bug. The cause appears to be related to the fact that this PDF is missing some end tags - as noted by @hynecker.

The fix is to return the right data for each stream. Any other fix to just swallow the error will result in bad data being used for all streams and so, for example, incorrect font definitions.

I believe the attached patch will fix your problem and should be safe to use in general.



回答2:

I fixed your problem in the source code, and I try on your file 2.pdf to make sure it worked.

In the file pdffont.py I replaced:

class TrueTypeFont(object):      class CMapNotFound(Exception):         pass      def __init__(self, name, fp):         self.name = name         self.fp = fp         self.tables = {}         self.fonttype = fp.read(4)         (ntables, _1, _2, _3) = struct.unpack('>HHHH', fp.read(8))         for _ in xrange(ntables):             (name, tsum, offset, length) = struct.unpack('>4sLLL', fp.read(16))             self.tables[name] = (offset, length)         return 

by this:

class TrueTypeFont(object):      class CMapNotFound(Exception):         pass      def __init__(self, name, fp):         self.name = name         self.fp = fp         self.tables = {}         self.fonttype = fp.read(4)         (ntables, _1, _2, _3) = struct.unpack('>HHHH', fp.read(8))         for _ in xrange(ntables):             fp_bytes = fp.read(16)             if len(fp_bytes) < 16:                 break             (name, tsum, offset, length) = struct.unpack('>4sLLL', fp_bytes)             self.tables[name] = (offset, length)         return 

Explanations

@Nabeel Ahmed was right

The foramt string >4sLLL requires 16 bytes size of buffer, which is specified correctly to fp.read to read 16 bytes at a time.

So, the problem can only be with the buffer stream it's reading i.e. the content of your specific PDF file.

In the code we see that fp.read(16) are made in a loop without any check.Thus, we don't know for sure if it successfully read it all. It could for instance reached an EOF.

To avoid this problem, I just break out of the for loop when this kind of problem appears.

    for _ in xrange(ntables):         fp_bytes = fp.read(16)         if len(fp_bytes) < 16:             break 

In any regular cases, it shouldn't change anything anyway.

I will try to do a pull request on github, but I'm not even sure it will be accepted so I suggest you do a monkey patch for now and modify your /home/danil/projects/python/pdfminer-source/env/local/lib/python2.7/site-packages/pdfminer/pdffont.py file right now.



回答3:

This is really an invalid PDF because there are some missing keywords endobj after three indirect objects. (object 5, 18 and 22)

The definition of an indirect object in a PDF file shall consist of its object number and generation number (separated by white space), followed by the value of the object bracketed between the keywords obj and endobj. (chapter 7.3.10 in PDF reference)

The example 2.pdf is a simple PDF 1.3 version that uses a simple uncompressed cross reference and uncompressed object separators. The failure can be easily found by grep command and by a general file viewer that the PDF has 22 indirect objects. The pattern " obj" is found correctly exactly 22 times (never accidentally in a string object or in a stream, fortunately for simplicity), but the keyword endobj is three times missing.

$ grep --binary-files=text -B1 -A2 -E " obj|endobj" 2.pdf ... 18 0 obj << /Length 451967/Length1 451967/Filter [/FlateDecode] >>  stream ... endstream                 % # see the missing "endobj" here 17 0 obj << /Length 12743 /Filter [/FlateDecode] >>  stream ... endstream endobj ... 

Similarly the object 5 has no endobj before object 1 and the object 22 has no endobj before object 21.

It is known that broken cross references in PDF can be and should be usually reconstructed by obj/endobj keywords (see the PDF reference, chapter C.2) Some applications do probably vice-versa fix missing endobj if cross references are correct, but it is no written advice.



回答4:

The last error message tells you a lot:

File "/usr/local/lib/python2.7/dist-packages/pdfminer/pdffont.py", line 375, in

init (name, tsum, offset, length) = struct.unpack('>4sLLL', fp.read(16)) struct.error: unpack requires a string argument of length 16

You can easily debug what is going on, for example, by putting necessary debug statements exactly in pdffont.py file. My guess is that there is something special about your pdf contents. Judging by the method name - TrueTypeFont - which throws the error message, there is some incompatibility with the font type.



回答5:

Let start with explaining the statement where you're getting exception:

struct.unpack('>4sLLL', fp.read(16)) 

where the synopsis is:

struct.unpack(fmt, buffer)

The method unpack, unpacks from the buffer buffer (which presumably earlier packed by pack(fmt, ...)) according to the format string fmt. The result is a tuple even if it contains exactly one item. The buffer’s size in bytes must match the size required by the format, as reflected by calcsize().

The most common case is, wrong number of bytes (16) for the format used (>4sLLL) - for example, for a format expecting 4 bytes, you have specified 3 bytes:

(name, tsum, offset, length) = struct.unpack('BH', fp.read(3)) 

for this you'll get

struct.error: unpack requires a string argument of length 4 

The reason - the format struct ('BH') expects 4 bytes i.e. when we pack something using 'BH' format it'll occupy 4 bytes of memory. A good explanation here.


To clarify it further - let's look into the >4sLLL format string. To verify the size unpack 'd be expecting for the buffer (the bytes you're reading from the PDF file). Quoting from docs:

The buffer’s size in bytes must match the size required by the format, as reflected by calcsize().

>>> import struct  >>> struct.calcsize('>4sLLL') 16 >>>  

To this point we can say there's nothing wrong with the statement:

(name, tsum, offset, length) = struct.unpack('>4sLLL', fp.read(16)) 

The foramt string >4sLLL requires 16 bytes size of buffer, which is specified correctly to fp.read to read 16 bytes at a time.

So, the problem can only be with the buffer stream it's reading i.e. the content of your specific PDF file.


Can be a bug - as per this comment:

This is a bug in the upstream PDFminer by @euske There seems to be patches for this so it should be an easy fix. Beyond this I also need to strengthen the pdf parsing such that we never error out from a failed parse

I'll edit the question it I find something helpful to add here - a solution, or a patch.



回答6:

In case you still get some struct errors after applying Peter's patch, especially when parsing many files in one script's run (using os.listdir), try changing resource manager caching to false.

rsrcmgr = PDFResourceManager(caching=False) 

It helped me to get rid of the rest of errors after applying above solutions.



易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!