How do I find all image-based PDFs?

前端 未结 2 822
太阳男子
太阳男子 2021-02-15 11:59

I have many PDF documents in my system, and I notice sometimes that documents are image-based without editing capability. In this case, I do OCR for better search in Foxit Phan

相关标签:
2条回答
  • 2021-02-15 12:14

    Purely from OCR field, we can use the Hough transform to find the biggest square in a page, then we calculate the ratio of its area and the whole area. If the ratio is low, we can think this page is slopping. Finally, statistics proportion of slopping pages and the page sum can indicate whether this PDF is scanned PDF.

    I know the process is very slow and the proportion is difficult to determine. ^-^

    0 讨论(0)
  • 2021-02-15 12:24

    Being late for the party, here's a simple solution implying that pdf files already containing fonts aren't image based only:

    find ./ -name "*.pdf" -print0 | xargs -0 -I {}      \ 
        bash -c 'export file="{}";                          \
        if [ $(pdffonts "$file" 2> /dev/null | \
        wc -l) -lt 3 ]; then echo "$file"; fi'
    
    • pdffonts lists all embedded fonts in a PDF file. If the contains searchable text, it also must contain fonts, so pdffonts will list them. Checking if result has less than three lines is because pdffonts' header is 2 lines. So all results lower than 3 lines don't have embedded fonts. AFAIK, there shouldn't be false positives altough this is more a question to ask pdffonts developers.

    As one-liner

    find ./ -name "*.pdf" -print0 | xargs -0 -I {} bash -c 'export file="{}"; if [ $(pdffonts "$file" 2> /dev/null | wc -l) -lt 3 ]; then echo "$file"; fi'
    

    Explanation: pdffonts file.pdf will show more than 2 lines if pdf contains text. Outputs filenames of all pdf files that don't contain text.

    My OCR project which has the same feature is in Github deajan/pmOCR.

    0 讨论(0)
提交回复
热议问题