How is hidden text stored in OCR-enhanced PDF files

后端 未结 1 1044
醉梦人生
醉梦人生 2021-01-03 07:55

// EDIT 26.03.2018 - Who wants to continue my work can have a look on my source-files https://github.com/n0l0cale/ocr-sampledata

I\'m actually looking for some detai

相关标签:
1条回答
  • 2021-01-03 08:21

    Does anyone know how these programs are storing their hidden text information really?

    You correctly have found out that the approach of Abby Finereader is different from that of Adobe Acrobat and of Tesseract:

    • Abby creates a page content stream in which first the text is drawn normally on the page and eventually covered by the scanned image.
    • Acrobat and Tesseract create content streams in which first the image is drawn and then the text is drawn invisibly (using text rendering mode 3 which draws nothing).

    The difference between the latter two results is the choice of font used:

    • Acrobat uses regular standard 14 fonts for which a PDF viewer has a font program to render them as normal glyphs.
    • Tesseract uses a font GlyphLessFont it embeds a font program for into the result file. When rendered the glyphs in this font do not show as our normal Latin glyphs but merely as empty space.

    Considering the visual effect you observed for the Abby result, the approach used by Acrobat or Tesseract might be preferable.

    Whether one prefers fonts with visually recognizable glyphs (as used by Acrobat) or without (as used by Tesseract), is mostly a mere matter of taste. They are used only in the invisible rendering mode anyways.

    0 讨论(0)
提交回复
热议问题