If I have 10,000 PDFs, some of which have been OCRed, some of which have 1 page that has been OCRed but the rest of the pages have not, how can I go through all the PDFs and
If by OCRed you mean that they contain the text in machine-readable form, you could use a library like Apache PDFBox to try to extract the text from the second page of the document. If it throws an error or returns garbage, it's most likely not OCRed.
Unburying this thread.
You can know which PDF files have already been OCRed by testing them with pdffonts. If there are embedded fonts, it's very probable that the PDF is already OCRed.
As for the batch processing, I wrote a little script that can batch OCR to pdf/word/excel/csv output format.
You may find it at https://github.com/deajan/pmOCR pmOCR (poor man's OCR is a wrapper for Abbyy OCR CLI for linux or Tesseract 3 open source solution).
This is exactly what I was looking for, I have thousands of scanned PDF files, where some were already OCR'ed and some are not.
So, I combined information I found on fora and Stack Overflow, and made my own solution that does EXACTLY that, which I have summarized for you here:
I am on Windows 10, and could not find the definitive answer. I tried doing this with Acrobat Pro, but that gave me many errors, and Acrobat's batch processing stops on every error or password-protected file. I also tried many other batch-OCR tools on Windows, but none worked well. I spent countless hours manually checking which files already had a text-layer "under" the image.
UNTIL! Microsoft announced that it was now very easy to run Linux under Windows, on the same machine, on the same filesystem. There are many more tools and utilities available on Linux than Windows, so I thought I would give that a try.
/mnt/c/Users/name/OneDrive/Documents
.find . -type f -name "*.pdf" -exec /your/homedir/pdf-ocr.sh '{}' \;
Running this might, of course, take a long time, depending on how many PDF's you have, and how many of those are not OCR'ed yet.
Here is the sh-script. You should save it somewhere in your home folder so that it is easy to call from anywhere. Like so:
cd ~
. This will bring you to your home folder.pico pdf-ocr.sh
. This will bring up an editor. Paste the below script code. Then press Ctrl+X, and press Y. Your file is now saved.sudo chmod +x pdf-ocr.sh
. This will give the script permission to be run.MYFONTS=$(pdffonts -l 5 "$1" | tail -n +3 | cut -d' ' -f1 | sort | uniq)
if [ "$MYFONTS" = '' ] || [ "$MYFONTS" = '[none]' ]; then
echo "Not yet OCR'ed: $1 -------- Processing...."
echo " "
ocrmypdf -l eng+deu+nld -s "$1" "$1"
echo " "
else
echo "Already OCR'ed: $1"
echo " "
fi
Well, the find
command looks up all PDF files in the current directory including subdirectories. It then "sends" these files to the script, in which pdffonts
checks if there are embedded fonts. If so, skip the file and try the next one. If no embedded fonts are found, use ocrmypdf
to do the OCR-ing.
I found the quality of OCR from ocrmypdf VERY good, even better than Acrobat's. You can of course tweak the settings. I can imagine for example that you might want to use other languages for OCR than eng+deu+nld
. You can look up all options here: https://ocrmypdf.readthedocs.io/en/latest/
Note: I am making the assumption here that if a PDF file has no embedded fonts (so it's basically an image (scan) in a PDF-file), that it has not OCR'ed. I know that this might not always be accurate and/or true, but for me that is enough to determine which files to put through OCR. So that it is not neccesary to re-do hundreds or thousands of PDF files....
I know that it is a bit more hassle to install Linux under Windows, but as it is very easy to do if you have basic Linux skills. For me it was worth the effort because I now have made "one click" batch processor that works. I could not find a solution for that with Windows-tools.
I hope someone finds this and finds this useful. If anyone has improvements, please post them here.
Thanks.
Jos Jonkeren
Why don't you re-OCR everything? The amount of time you spend agonizing over repeated work probably exceeds the time taken for the work itself.