Unable to extract scanned pdf using TesseractOCRConfig Apache Tika

后端 未结 1 1061
迷失自我
迷失自我 2021-01-15 10:04

My pdf contains scanned images and I want to extract text from it.

What I tried : I tried with AutoDetectParsers but no output.

I followed the solution provi

相关标签:
1条回答
  • 2021-01-15 10:17

    Steps to follow to solve this :

    1. Install Tesseract in your system using 'tesseract-ocr-setup-3.05.00dev.exe' for Windows from: https://sourceforge.net/projects/tesseract-ocr-alt/files/ and set its location in your config.

      Java code :

      Parser parser = new AutoDetectParser();
      BodyContentHandler handler = new BodyContentHandler(Integer.MAX_VALUE);
      TesseractOCRConfig config = new TesseractOCRConfig();
      config.setTesseractPath(tPath);
      PDFParserConfig pdfConfig = new PDFParserConfig();
      pdfConfig.setExtractInlineImages(true);
      pdfConfig.setExtractUniqueInlineImagesOnly(false); // set to false if pdf contains multiple images.
      ParseContext parseContext = new ParseContext();
      parseContext.set(TesseractOCRConfig.class, config);
      parseContext.set(PDFParserConfig.class, pdfConfig);
      //need to add this to make sure recursive parsing happens!
      parseContext.set(Parser.class, parser);
      
    2. Maven dependencies :

    <dependencies> <dependency> <groupId>org.apache.tika</groupId> <artifactId>tika-parsers</artifactId> <version>1.13</version> </dependency> <dependency> <groupId>com.levigo.jbig2</groupId> <artifactId>levigo-jbig2-imageio</artifactId> <version>1.6.5</version> </dependency> <dependency> <groupId>com.github.jai-imageio</groupId> <artifactId>jai-imageio-core</artifactId> <version>1.3.1</version> </dependency> </dependencies>

    I think it may be helpful. Thanks.

    0 讨论(0)
提交回复
热议问题