问题
I would like to write a script which runs a command to OCR
pdfs, which deletes the resulting images, after the text files has been written.
The two commands I want to combine are the following.
This command create folders, extract pgm
from each PDF
and adds them into each folder:
time find . -name \*.pdf | parallel -j 4 --progress 'mkdir -p {.} && gs -dQUIET -dINTERPOLATE -dSAFER -dBATCH -dNOPAUSE -dPDFSETTINGS=/screen -dNumRenderingThreads=4 -sDEVICE=pgmraw -r300 -dTextAlphaBits=4 -sProcessColorModel=DeviceGray -sColorConversionStrategy=Gray -dOverrideICC -o {.}/{.}-%03d.pgm {}'
This commands does the OCR and deletes the resulting images (pgm
):
time find . -name \*.pgm | parallel -j 4 --progress 'tesseract {} {.} -l deu_frak && rm {.}.pgm'
I would like to combine both commands so that the script deletes the pgm
images after each OCR. If I run the above commands, the first will extract images and will eat up my disk space, then the second command would do the OCR and only after that delete the images as a last step.
So,
- Create folder
- Extract PGM from PDF
- OCR from PGM to txt
- Delete PGM images, which just have been used (missing)
Basically, I would like this 4 steps to be done in this order for each PDF
separated and not for all PDF
at once. How can I do this?
Edit:
My first attempt to solve my issues was to create the following command:
time find . -name \*.pdf | parallel -j 4 -m --progress --eta 'mkdir -p {.} && gs -dQUIET -dINTERPOLATE -dSAFER -dBATCH -dNOPAUSE -dPDFSETTINGS=/screen -dNumRenderingThreads=4 -sDEVICE=pgmraw -r300 -dTextAlphaBits=4 -sProcessColorModel=DeviceGray -sColorConversionStrategy=Gray -dOverrideICC -o {.}/{.}-%03d.pgm {}' && time find . -name \*.pgm | parallel -j 4 --progress --eta 'tesseract {} {.} -l deu_frak && rm {.}.pgm'
However, tesseract would not find the language package.
回答1:
Updated Answer
I have not tested this please run it on a copy of a small subset of your files. You can turn off the messages with DEBUG:
at the start if you are happy it looks good:
#!/bin/bash
# Declare a function for "parallel" to call
doit() {
# Get name of PDF with and without extension
withext="$1"
noext="$2"
echo "DEBUG: Processing $withext into $noext"
# Make output directory
mkdir -p "$noext"
# Extract as PGM into subdirectory
gs ... -o "$noext"/"${noext}-%03d.pgm $withext"
# Go to target directory or die with error message
cd "$noext" || { echo ERROR: Failed to cd to $noext ; exit 1; }
# OCR and remove each PGM
n=0
for f in *pgm; do
echo "DEBUG: OCR $f into $n"
tesseract "$f" "$n" -l deu_frak
echo "DEBUG: Remove $f"
rm "$f"
((n=n+1))
done
}
# Ensure the function is exported to subshells
export -f doit
find . -name \*.pdf -print0 | parallel -0 doit {} {.}
You should be able to test the doit()
function without parallel
by running:
doit someFile.pdf someFile
Original Answer
If you want to do lots of things for each argument in GNU Parallel, the simplest way is to declare a bash
function and then call that.
It looks like this:
# Declare a function for "parallel" to call
doit() {
echo "$1" "$2"
# mkdir something
# extract PGM
# do OCR
# delete PGM
}
# Ensure the function is exported to subshells
export -f doit
find some files -print0 | parallel -0 doit {} {.}
来源:https://stackoverflow.com/questions/45031033/combine-two-commands-using-gnu-parallel-for-ocr-project