Combine two commands using GNU parallel for OCR project

问题

I would like to write a script which runs a command to OCR pdfs, which deletes the resulting images, after the text files has been written.

The two commands I want to combine are the following.

This command create folders, extract pgm from each PDF and adds them into each folder:

time find . -name \*.pdf | parallel -j 4 --progress 'mkdir -p {.} && gs -dQUIET -dINTERPOLATE -dSAFER -dBATCH -dNOPAUSE -dPDFSETTINGS=/screen -dNumRenderingThreads=4 -sDEVICE=pgmraw -r300 -dTextAlphaBits=4 -sProcessColorModel=DeviceGray -sColorConversionStrategy=Gray -dOverrideICC -o {.}/{.}-%03d.pgm {}'

This commands does the OCR and deletes the resulting images (pgm):

time find . -name \*.pgm | parallel -j 4 --progress 'tesseract {} {.} -l deu_frak && rm {.}.pgm'

I would like to combine both commands so that the script deletes the pgm images after each OCR. If I run the above commands, the first will extract images and will eat up my disk space, then the second command would do the OCR and only after that delete the images as a last step.

So,

Create folder
Extract PGM from PDF
OCR from PGM to txt
Delete PGM images, which just have been used (missing)

Basically, I would like this 4 steps to be done in this order for each PDF separated and not for all PDF at once. How can I do this?

Edit:

My first attempt to solve my issues was to create the following command:

time find . -name \*.pdf | parallel -j 4 -m --progress --eta 'mkdir -p {.} && gs -dQUIET -dINTERPOLATE -dSAFER -dBATCH -dNOPAUSE -dPDFSETTINGS=/screen -dNumRenderingThreads=4 -sDEVICE=pgmraw -r300 -dTextAlphaBits=4 -sProcessColorModel=DeviceGray -sColorConversionStrategy=Gray -dOverrideICC -o {.}/{.}-%03d.pgm {}' && time find . -name \*.pgm | parallel -j 4 --progress --eta 'tesseract {} {.} -l deu_frak && rm {.}.pgm'

However, tesseract would not find the language package.

回答1:

Updated Answer

I have not tested this please run it on a copy of a small subset of your files. You can turn off the messages with DEBUG: at the start if you are happy it looks good:

#!/bin/bash

# Declare a function for "parallel" to call
doit() {
    # Get name of PDF with and without extension
    withext="$1"
    noext="$2"
    echo "DEBUG: Processing $withext into $noext"

    # Make output directory
    mkdir -p "$noext"

    # Extract as PGM into subdirectory
    gs ... -o "$noext"/"${noext}-%03d.pgm $withext"

    # Go to target directory or die with error message
    cd "$noext" || { echo ERROR: Failed to cd to $noext ; exit 1; }

    # OCR and remove each PGM 
    n=0
    for f in *pgm; do
       echo "DEBUG: OCR $f into $n"
       tesseract "$f" "$n" -l deu_frak
       echo "DEBUG: Remove $f"
       rm "$f"
       ((n=n+1))
    done 
}

# Ensure the function is exported to subshells
export -f doit

find . -name \*.pdf -print0 | parallel -0 doit {} {.}

You should be able to test the doit() function without parallel by running:

doit someFile.pdf someFile

Original Answer

If you want to do lots of things for each argument in GNU Parallel, the simplest way is to declare a bash function and then call that.

It looks like this:

# Declare a function for "parallel" to call
doit() {
    echo "$1" "$2"
    # mkdir something
    # extract PGM
    # do OCR
    # delete PGM
}

# Ensure the function is exported to subshells
export -f doit

find some files -print0 | parallel -0 doit {} {.}

来源：https://stackoverflow.com/questions/45031033/combine-two-commands-using-gnu-parallel-for-ocr-project

标签

pdf

parallel-processing

ocr

tesseract

pgm