PDFTextStripper parsing with wrong encoding

后端 未结 1 811
爱一瞬间的悲伤
爱一瞬间的悲伤 2021-01-26 18:08
PDFTextStripper stripper = new PDFText2HTML(encoding);
String result = stripper.getText(document).trim();

result contains something like



        
相关标签:
1条回答
  • 2021-01-26 18:49

    How to parse this pdf?

    Short of OCR'ing it you don't.

    The PDF in question does not contain the information required to extract text without doing at least some OCR (at least OCR'ing each character of the used font to find a mapping from glyph to character) which would require additional libraries and code.

    As a requirement for text extraction the PDF specification ISO 32000-1:2008 correctly states in section 9.10.2 that the font used for the text to extract needs to

    • either contain a ToUnicode CMap — the font used in your document doesn't —
    • or be a composite font that uses one of the predefined CMaps listed in Table 118 (except Identity–H and Identity–V) or whose descendant CIDFont uses the Adobe-GB1, Adobe-CNS1, Adobe-Japan1, or Adobe-Korea1 character collection — the font used in your document isn't —
    • or be a simple font that uses one of the predefined encodings MacRomanEncoding, MacExpertEncoding, or WinAnsiEncoding, or that has an encoding whose Differences array includes only character names taken from the Adobe standard Latin character set and the set of named characters in the Symbol font — the font used in your document neither uses one of those predefined encodings nor are the character names in its Differences array from those selections mentioned: the names used are /0, /1, ..., /155.

    Generally a good first test is to try and copy&paste text using Adobe Reader as much text extraction experience is in the Reader's code. When trying to do so, you'll see that you only get garbage.

    0 讨论(0)
提交回复
热议问题