Batch OCRing PDFs that haven't already been OCR'd

后端 未结 4 951
滥情空心
滥情空心 2021-01-14 16:04

If I have 10,000 PDFs, some of which have been OCRed, some of which have 1 page that has been OCRed but the rest of the pages have not, how can I go through all the PDFs and

相关标签:
4条回答
  • 2021-01-14 16:29

    If by OCRed you mean that they contain the text in machine-readable form, you could use a library like Apache PDFBox to try to extract the text from the second page of the document. If it throws an error or returns garbage, it's most likely not OCRed.

    0 讨论(0)
  • 2021-01-14 16:38

    Unburying this thread.

    You can know which PDF files have already been OCRed by testing them with pdffonts. If there are embedded fonts, it's very probable that the PDF is already OCRed.

    As for the batch processing, I wrote a little script that can batch OCR to pdf/word/excel/csv output format.

    You may find it at https://github.com/deajan/pmOCR pmOCR (poor man's OCR is a wrapper for Abbyy OCR CLI for linux or Tesseract 3 open source solution).

    0 讨论(0)
  • 2021-01-14 16:40

    This is exactly what I was looking for, I have thousands of scanned PDF files, where some were already OCR'ed and some are not.

    So, I combined information I found on fora and Stack Overflow, and made my own solution that does EXACTLY that, which I have summarized for you here:

    • scan through all subdirectories recursively for PDF files;
    • check if the PDF was already OCR'ed, and if not, process the PDF with OCR with high quality, in the language(s) you can specify;
    • save the OCR PDF in-place, as PDF/A, and overwriting the old (not-OCR'ed) one.

    I am on Windows 10, and could not find the definitive answer. I tried doing this with Acrobat Pro, but that gave me many errors, and Acrobat's batch processing stops on every error or password-protected file. I also tried many other batch-OCR tools on Windows, but none worked well. I spent countless hours manually checking which files already had a text-layer "under" the image.

    UNTIL! Microsoft announced that it was now very easy to run Linux under Windows, on the same machine, on the same filesystem. There are many more tools and utilities available on Linux than Windows, so I thought I would give that a try.

    So, here it is, step by step:

    1. Enable the Windows subsystem for Linux in the Windows Control Panel; there are many guides. Google it. It's a couple of minutes.
    2. Install Linux from the Windows Store. Open the Windows Store, search for Ubuntu, and install. Takes around 5 minutes.
    3. Now you have the "Ubuntu app". Run it. It shows you the linux bash, and with file access to your Windows files through /mnt/c. It's magic!
    4. You need some Linux "apps", namely pdffonts and ocrmypdf; which you can install by using the command sudo apt install pdffonts and sudo apt install ocrmypdf. We will use these apps to check if there is an embedded font in a PDF, and if not, OCR the PDF. (see note below).
    5. Install the very small bash script (below) to your home directory ~.
    6. Go to (cd) the directory where all your PDF's are saved. For example: /mnt/c/Users/name/OneDrive/Documents.
    7. Run the command: find . -type f -name "*.pdf" -exec /your/homedir/pdf-ocr.sh '{}' \;

    Done!

    Running this might, of course, take a long time, depending on how many PDF's you have, and how many of those are not OCR'ed yet.

    Here is the sh-script. You should save it somewhere in your home folder so that it is easy to call from anywhere. Like so:

    1. type cd ~. This will bring you to your home folder.
    2. type pico pdf-ocr.sh. This will bring up an editor. Paste the below script code. Then press Ctrl+X, and press Y. Your file is now saved.
    3. type sudo chmod +x pdf-ocr.sh. This will give the script permission to be run.
    MYFONTS=$(pdffonts -l 5 "$1" | tail -n +3 | cut -d' ' -f1 | sort | uniq)
    if [ "$MYFONTS" = '' ] || [ "$MYFONTS" = '[none]' ]; then
        echo "Not yet OCR'ed: $1 -------- Processing...."
            echo " "
            ocrmypdf -l eng+deu+nld -s "$1" "$1"
            echo " "
    else
        echo "Already OCR'ed: $1"
    echo " "
    fi
    

    What does this do?

    Well, the find command looks up all PDF files in the current directory including subdirectories. It then "sends" these files to the script, in which pdffonts checks if there are embedded fonts. If so, skip the file and try the next one. If no embedded fonts are found, use ocrmypdf to do the OCR-ing. I found the quality of OCR from ocrmypdf VERY good, even better than Acrobat's. You can of course tweak the settings. I can imagine for example that you might want to use other languages for OCR than eng+deu+nld. You can look up all options here: https://ocrmypdf.readthedocs.io/en/latest/

    Note: I am making the assumption here that if a PDF file has no embedded fonts (so it's basically an image (scan) in a PDF-file), that it has not OCR'ed. I know that this might not always be accurate and/or true, but for me that is enough to determine which files to put through OCR. So that it is not neccesary to re-do hundreds or thousands of PDF files....

    I know that it is a bit more hassle to install Linux under Windows, but as it is very easy to do if you have basic Linux skills. For me it was worth the effort because I now have made "one click" batch processor that works. I could not find a solution for that with Windows-tools.

    I hope someone finds this and finds this useful. If anyone has improvements, please post them here.

    Thanks.

    Jos Jonkeren

    0 讨论(0)
  • 2021-01-14 16:56

    Why don't you re-OCR everything? The amount of time you spend agonizing over repeated work probably exceeds the time taken for the work itself.

    0 讨论(0)
提交回复
热议问题