How do I find all image-based PDFs?

前端未结

关注

 2  830

I have many PDF documents in my system, and I notice sometimes that documents are image-based without editing capability. In this case, I do OCR for better search in Foxit Phan

相关标签:

2条回答

失恋的感觉

2021-02-15 12:14

Purely from OCR field, we can use the Hough transform to find the biggest square in a page, then we calculate the ratio of its area and the whole area. If the ratio is low, we can think this page is slopping. Finally, statistics proportion of slopping pages and the page sum can indicate whether this PDF is scanned PDF.

I know the process is very slow and the proportion is difficult to determine. ^-^

0 讨论(0)
发布评论:

提交评论
- 加载中...
孤街浪徒

2021-02-15 12:24
Being late for the party, here's a simple solution implying that pdf files already containing fonts aren't image based only:
```
find ./ -name "*.pdf" -print0 | xargs -0 -I {}      \ 
    bash -c 'export file="{}";                          \
    if [ $(pdffonts "$file" 2> /dev/null | \
    wc -l) -lt 3 ]; then echo "$file"; fi'
```
- pdffonts lists all embedded fonts in a PDF file. If the contains searchable text, it also must contain fonts, so pdffonts will list them. Checking if result has less than three lines is because pdffonts' header is 2 lines. So all results lower than 3 lines don't have embedded fonts. AFAIK, there shouldn't be false positives altough this is more a question to ask pdffonts developers.
As one-liner
```
find ./ -name "*.pdf" -print0 | xargs -0 -I {} bash -c 'export file="{}"; if [ $(pdffonts "$file" 2> /dev/null | wc -l) -lt 3 ]; then echo "$file"; fi'
```
Explanation: pdffonts file.pdf will show more than 2 lines if pdf contains text. Outputs filenames of all pdf files that don't contain text.

My OCR project which has the same feature is in Github deajan/pmOCR.
0 讨论(0)
发布评论:

提交评论
- 加载中...