How do I find all image-based PDFs?

前端 未结 2 824
太阳男子
太阳男子 2021-02-15 11:59

I have many PDF documents in my system, and I notice sometimes that documents are image-based without editing capability. In this case, I do OCR for better search in Foxit Phan

2条回答
  •  孤街浪徒
    2021-02-15 12:24

    Being late for the party, here's a simple solution implying that pdf files already containing fonts aren't image based only:

    find ./ -name "*.pdf" -print0 | xargs -0 -I {}      \ 
        bash -c 'export file="{}";                          \
        if [ $(pdffonts "$file" 2> /dev/null | \
        wc -l) -lt 3 ]; then echo "$file"; fi'
    
    • pdffonts lists all embedded fonts in a PDF file. If the contains searchable text, it also must contain fonts, so pdffonts will list them. Checking if result has less than three lines is because pdffonts' header is 2 lines. So all results lower than 3 lines don't have embedded fonts. AFAIK, there shouldn't be false positives altough this is more a question to ask pdffonts developers.

    As one-liner

    find ./ -name "*.pdf" -print0 | xargs -0 -I {} bash -c 'export file="{}"; if [ $(pdffonts "$file" 2> /dev/null | wc -l) -lt 3 ]; then echo "$file"; fi'
    

    Explanation: pdffonts file.pdf will show more than 2 lines if pdf contains text. Outputs filenames of all pdf files that don't contain text.

    My OCR project which has the same feature is in Github deajan/pmOCR.

提交回复
热议问题