How do I find all image-based PDFs?

前端 未结 2 823
太阳男子
太阳男子 2021-02-15 11:59

I have many PDF documents in my system, and I notice sometimes that documents are image-based without editing capability. In this case, I do OCR for better search in Foxit Phan

2条回答
  •  失恋的感觉
    2021-02-15 12:14

    Purely from OCR field, we can use the Hough transform to find the biggest square in a page, then we calculate the ratio of its area and the whole area. If the ratio is low, we can think this page is slopping. Finally, statistics proportion of slopping pages and the page sum can indicate whether this PDF is scanned PDF.

    I know the process is very slow and the proportion is difficult to determine. ^-^

提交回复
热议问题