How do I find all image-based PDFs?

前端未结

关注

 2  824

太阳男子 2021-02-15 11:59

I have many PDF documents in my system, and I notice sometimes that documents are image-based without editing capability. In this case, I do OCR for better search in Foxit Phan

2条回答

孤街浪徒 (楼主)

2021-02-15 12:24
Being late for the party, here's a simple solution implying that pdf files already containing fonts aren't image based only:
```
find ./ -name "*.pdf" -print0 | xargs -0 -I {}      \ 
    bash -c 'export file="{}";                          \
    if [ $(pdffonts "$file" 2> /dev/null | \
    wc -l) -lt 3 ]; then echo "$file"; fi'
```
- pdffonts lists all embedded fonts in a PDF file. If the contains searchable text, it also must contain fonts, so pdffonts will list them. Checking if result has less than three lines is because pdffonts' header is 2 lines. So all results lower than 3 lines don't have embedded fonts. AFAIK, there shouldn't be false positives altough this is more a question to ask pdffonts developers.
As one-liner
```
find ./ -name "*.pdf" -print0 | xargs -0 -I {} bash -c 'export file="{}"; if [ $(pdffonts "$file" 2> /dev/null | wc -l) -lt 3 ]; then echo "$file"; fi'
```
Explanation: pdffonts file.pdf will show more than 2 lines if pdf contains text. Outputs filenames of all pdf files that don't contain text.

My OCR project which has the same feature is in Github deajan/pmOCR.
0 讨论(0)

查看其它2个回答
发布评论:

提交评论
- 加载中...