I have many PDF documents in my system, and I notice sometimes that documents are image-based without editing capability. In this case, I do OCR for better search in Foxit Phan
Purely from OCR field, we can use the Hough transform to find the biggest square in a page, then we calculate the ratio of its area and the whole area. If the ratio is low, we can think this page is slopping. Finally, statistics proportion of slopping pages and the page sum can indicate whether this PDF is scanned PDF.
I know the process is very slow and the proportion is difficult to determine. ^-^
Being late for the party, here's a simple solution implying that pdf files already containing fonts aren't image based only:
find ./ -name "*.pdf" -print0 | xargs -0 -I {} \
bash -c 'export file="{}"; \
if [ $(pdffonts "$file" 2> /dev/null | \
wc -l) -lt 3 ]; then echo "$file"; fi'
As one-liner
find ./ -name "*.pdf" -print0 | xargs -0 -I {} bash -c 'export file="{}"; if [ $(pdffonts "$file" 2> /dev/null | wc -l) -lt 3 ]; then echo "$file"; fi'
Explanation:
pdffonts file.pdf
will show more than 2 lines if pdf contains text.
Outputs filenames of all pdf files that don't contain text.
My OCR project which has the same feature is in Github deajan/pmOCR.