PDF text extraction returns wrong characters due to ToUnicode map

前端 未结 1 533
感动是毒
感动是毒 2020-12-16 07:09

I am trying to extract text from a foreign language PDF file using PDFMiner, but am being foiled by a ToUnicode statement. The file behaves strangely even under normal PDF v

相关标签:
1条回答
  • 2020-12-16 07:53

    In short:

    Your PDF does not contain the information required for correct text extraction without the use of OCR.

    In detail:

    Both the ToUnicode Map and the Unicode entries in the font program of the embedded subset of Mangal-Regular in your PDF claim that these four glyphs

    Four glyphs claiming to be 0x915

    all represent the same Unicode code point, 0x915.

    Thus, any text extraction program which does not look at the drawn glyph (i.e. not attempt OCR) will return 0x915 for either one of those glyphs.

    Background:

    You seem to wonder why the PDF viewers correctly display the text but text extraction (copy&paste or PDFMiner) does not correctly extract.

    The reason is that PDF as a format does not contain the text as such. It contains pointers (direct ones or via mappings) to glyph drawing instructions in embedded font programs. Using these pointers the PDF is drawn as you expect.

    Furthermore it can contain extra information mapping such glyph pointers to Unicode code points. Such extra information is used by text extracting programs. In case of your PDF these mappings are incorrect and, therefore, extracted text is incorrect.

    0 讨论(0)
提交回复
热议问题