问题
I'm trying to create an index on a PDF file that I scanned as images from an old original manuscript, then put through character recognition in Adobe Acrobat Pro. The problem is some of the words were spaced funny so the OCR ended up with flaws. I used the find and fix suspects tool but there are still problems.
Case in point...
The text "FOR EXAMPLE" was spaced funny in the original document (and its image of course) so that Adobe reads it as three words "FOR EX AMPLE" which then results in an index entry for the word "ample" that looks perfectly valid if I did not know better. This is one of several similar problems with the document that I have identified so far (still more pages to proofread).
How can I fix the underlying OCR text so that it contains the correct information both in the created index and when searching the document.
PS: I cannot just switch to a pure OCR text version of the document since the manuscript is technical and has lots of drawings associated with the text. I need to keep the images and alter the "hidden" searchable text underneath.
回答1:
I found this answer suggesting ABBYY FineReader 14 (commercial; I am not affiliated). It looks like it will handle the editing, after which I presume your existing workflow would take care of the indexing. Here is another answer giving some more workflow details (albeit three years ago).
Separately, this question has answers suggesting Perl's CAM::PDF and pdftk.
来源:https://stackoverflow.com/questions/49328429/pdf-how-can-i-override-fix-searchable-text-in-a-scanned-image-ocr-file