Why character ID 160 is not recognised as Unicode in PDFMiner?

问题

I am converting .pdf files into .xml files using PDFMiner.

For each word in the .pdf file, PDFMiner checks whether it is Unicode or not (among many other things). If it is, it returns the character, if it is not, it raises an exception and returns the string "(cid:%d)" where %d is the character id, which I think is the Unicode Decimal.

This is well explained in the edit part of this question: What is this (cid:51) in the output of pdf2txt?. I report the code here for convenience:

def render_char(self, matrix, font, fontsize, scaling, rise, cid):
    try:
        text = font.to_unichr(cid)
        assert isinstance(text, unicode), text
    except PDFUnicodeNotDefined:
        text = self.handle_undefined_char(font, cid)


def handle_undefined_char(self, font, cid):
    if self.debug:
        print >>sys.stderr, 'undefined: %r, %r' % (font, cid)
    return '(cid:%d)' % cid

I usually get this Exception for .pdf files written in Cyrillic. However, there is one file that uses plain English and where I get this Exception for non breaking spaces (that have cid=160). I do not understand why this character is not recognised as Unicode, while all others in the same file are.

If, on the same environment, I run isinstance(u'160', unicode) in the console I get True, while an (apparently) equivalent command is returning False when it's run inside PDFMiner.

If I debug, I see that the font is properly recognised, i.e. I get:

cid = 160
font =  <PDFType1Font: basefont='Helvetica'>

PDFMiner accepts the codec as a parameter. I have chosen utf-8, which has 160 as Unicode Decimal for non breaking space (http://dev.networkerror.org/utf8/).

If it might help, here is the code for to_unichr:

def to_unichr(self, cid):
    if self.unicode_map:
        try:
            return self.unicode_map.get_unichr(cid)
        except KeyError:
            pass
    try:
        return self.cid2unicode[cid]
    except KeyError:
        raise PDFUnicodeNotDefined(None, cid)

Is there a way to set/change the character map recognised by the code?

What do you think I should change, or where do you think I should investigate, so that cid=160 does not raise the Exception?

回答1:

The font in question in the sample document is a Simple Font and uses WinAnsiEncoding. This encoding is defined in the PDF specification ISO 32000-1 as one of four special encodings in a table in Annex D.2 Latin Character Set and Encodings. This table does not contain an entry for 240 (= decimal 160. The table entries are given as octal numbers!) in the WIN column.

This table is extracted as the ENCODING array in latin_enc.py, and from this array maps for those four encodings are generated in encodingdb.py which then are used, e.g. for fonts with that very encoding, cf PDFSimpleFont in pdffont.py.

Thus, the code 160 is not recognized by PdfMiner as having any associated character in WinAnsiEncoding. This causes your problem.

Only looking at the table that seems correct, but if one reads the notes below the table, one finds:

The SPACE character shall also be encoded as 312 in MacRomanEncoding and as 240 in WinAnsiEncoding. This duplicate code shall signify a nonbreaking space; it shall be typographically the same as (U+003A) SPACE.

This seems to have been overlooked by PdfMiner development.

This oversight might be fixed by adding an second entry for space

('nbspace', None, 202, 160, None)

to the ENCODING array (which is using decimal numbers); if you prefer, you might want to use space instead.

(I say might because I'm not into Python programming and, therefore, cannot check, in particular not for unwanted side effects.)

回答2:

One solution that works for me for similar characters in a different file is to use ftfy.fix_text(). I was drawn to this package fixing mojibake baked into a pdf's unicode, basically your typical curly quote hijinks between different encodings. Pdfminer caught them as "(cid:146)", etc., but I wanted to clean them up further. This class works on that one file so far; it includes the minimum to make it print something, but there would probably be more pdfminer elements in a working module. If one is using pdf2txt.py, perhaps one could put a copy somewhere safe, redirect the pdfminer.high_level.extract_text_to_fp(fp, **locals()) line to a safe copy of that module, tack this class onto the end of that, and swap it for the base class it inherits. I've just done the HTMLConverter, but the other ones could probably be handled similarly.

from pdfminer.converter import HTMLConverter
from io                 import BytesIO
class HTMLConvertOre(HTMLConverter):
    import ftfy, six
    from pdfminer.layout    import LTChar
    from pdfminer.pdffont   import PDFUnicodeNotDefined
    def __init__(self, rsrcmgr, outfp, codec='utf-8', pageno=1, laparams=None,
                 scale=1, fontscale=1.0, layoutmode='normal', showpageno=True,
                 pagemargin=50, imagewriter=None, debug=0,
                 rect_colors={'curve': 'black', 'page': 'gray'},
                 text_colors={'char': 'black'}):
        """Initialize pdfminer.converter HTMLConverter."""
        HTMLConverter.__init__(**locals())
    def render_char(self, matrix, font, fontsize, scaling, rise, cid, ncs,
                    graphicstate):
        """Mod invoking ftfy.fix_text() to possibly rescue bad cids."""
        try:
            text = font.to_unichr(cid)
            assert isinstance(text, six.text_type), str(type(text))
        except PDFUnicodeNotDefined:
            try:
                text = ftfy.fix_text(chr(cid), uncurl_quotes=False)
                assert isinstance(text, six.text_type), str(type(text))
                cid=ord(text)
            except PDFUnicodeNotDefined:
                text = self.handle_undefined_char(font, cid)
        textwidth = font.char_width(cid)
        textdisp = font.char_disp(cid)
        item = LTChar(matrix, font, fontsize, scaling, rise, text, textwidth,
                      textdisp, ncs, graphicstate)
        self.cur_item.add(item)
        return item.adv
if __name__ == '__main__':
    rsrcmgr = PDFResourceManager()
    outfp = BytesIO()
    device = HTMLConvertOre(rsrcmgr, outfp)
    print(device)

回答3:

For those who got the above error, below code might help you.

import minecart
from PIL import Image
import io

pdffile = open('sample.pdf', 'rb')
doc = minecart.Document(pdffile)

for page in doc.iter_pages():
    im = page.images[0]#taking only one image per page
    byteArray = im.obj.get_data()
    image = Image.open(io.BytesIO(byteArray))
    image.show()

Hope it helps!!

Please refer https://github.com/felipeochoa/minecart/issues/16 .

来源：https://stackoverflow.com/questions/34108647/why-character-id-160-is-not-recognised-as-unicode-in-pdfminer

标签

python

pdf

utf-8

python-unicode

pdfminer