Batch OCRing PDFs that haven't already been OCR'd

后端未结

关注

 4  952

滥情空心 2021-01-14 16:04

If I have 10,000 PDFs, some of which have been OCRed, some of which have 1 page that has been OCRed but the rest of the pages have not, how can I go through all the PDFs and

4条回答

不知归路 (楼主)

2021-01-14 16:29

If by OCRed you mean that they contain the text in machine-readable form, you could use a library like Apache PDFBox to try to extract the text from the second page of the document. If it throws an error or returns garbage, it's most likely not OCRed.

0 讨论(0)

查看其它4个回答
发布评论:

提交评论
- 加载中...