Batch OCRing PDFs that haven't already been OCR'd

后端 未结 4 952
滥情空心
滥情空心 2021-01-14 16:04

If I have 10,000 PDFs, some of which have been OCRed, some of which have 1 page that has been OCRed but the rest of the pages have not, how can I go through all the PDFs and

4条回答
  •  不知归路
    2021-01-14 16:29

    If by OCRed you mean that they contain the text in machine-readable form, you could use a library like Apache PDFBox to try to extract the text from the second page of the document. If it throws an error or returns garbage, it's most likely not OCRed.

自定义标题
段落格式
字体
字号
代码语言
提交回复
热议问题