发表新帖

发表新帖

How is hidden text stored in OCR-enhanced PDF files

后端未结

关注

 1  1045

// EDIT 26.03.2018 - Who wants to continue my work can have a look on my source-files https://github.com/n0l0cale/ocr-sampledata

I\'m actually looking for some detai

相关标签:

1条回答

夕颜

2021-01-03 08:21
Does anyone know how these programs are storing their hidden text information really?

You correctly have found out that the approach of Abby Finereader is different from that of Adobe Acrobat and of Tesseract:
- Abby creates a page content stream in which first the text is drawn normally on the page and eventually covered by the scanned image.
- Acrobat and Tesseract create content streams in which first the image is drawn and then the text is drawn invisibly (using text rendering mode 3 which draws nothing).
The difference between the latter two results is the choice of font used:
- Acrobat uses regular standard 14 fonts for which a PDF viewer has a font program to render them as normal glyphs.
- Tesseract uses a font GlyphLessFont it embeds a font program for into the result file. When rendered the glyphs in this font do not show as our normal Latin glyphs but merely as empty space.
Considering the visual effect you observed for the Abby result, the approach used by Acrobat or Tesseract might be preferable.

Whether one prefers fonts with visually recognizable glyphs (as used by Acrobat) or without (as used by Tesseract), is mostly a mere matter of taste. They are used only in the invisible rendering mode anyways.
0 讨论(0)
发布评论:

提交评论
- 加载中...

热议问题