问题
My pdf contains scanned images and I want to extract text from it.
What I tried : I tried with AutoDetectParsers but no output.
I followed the solution provided in Apache Tika extract scanned PDF files and also Apache Tika Jira at https://issues.apache.org/jira/browse/TIKA-1729 but getting empty string without any error.
My configuration : Win 7 64-bit OS, JDK 1.8.0_45.
Any kind of help is welcome.
回答1:
Steps to follow to solve this :
Install Tesseract in your system using 'tesseract-ocr-setup-3.05.00dev.exe' for Windows from: https://sourceforge.net/projects/tesseract-ocr-alt/files/ and set its location in your config.
Java code :
Parser parser = new AutoDetectParser(); BodyContentHandler handler = new BodyContentHandler(Integer.MAX_VALUE); TesseractOCRConfig config = new TesseractOCRConfig(); config.setTesseractPath(tPath); PDFParserConfig pdfConfig = new PDFParserConfig(); pdfConfig.setExtractInlineImages(true); pdfConfig.setExtractUniqueInlineImagesOnly(false); // set to false if pdf contains multiple images. ParseContext parseContext = new ParseContext(); parseContext.set(TesseractOCRConfig.class, config); parseContext.set(PDFParserConfig.class, pdfConfig); //need to add this to make sure recursive parsing happens! parseContext.set(Parser.class, parser);
Maven dependencies :
<dependencies>
<dependency>
<groupId>org.apache.tika</groupId>
<artifactId>tika-parsers</artifactId>
<version>1.13</version>
</dependency>
<dependency>
<groupId>com.levigo.jbig2</groupId>
<artifactId>levigo-jbig2-imageio</artifactId>
<version>1.6.5</version>
</dependency>
<dependency>
<groupId>com.github.jai-imageio</groupId>
<artifactId>jai-imageio-core</artifactId>
<version>1.3.1</version>
</dependency>
</dependencies>
I think it may be helpful. Thanks.
来源:https://stackoverflow.com/questions/39762841/unable-to-extract-scanned-pdf-using-tesseractocrconfig-apache-tika