Extract Images from PDF with Apache Tika

前端未结

关注

 2  1524

情歌与酒 2021-01-06 16:01

Apache Tika 1.6 has the ability to extract inline images from PDF documents. However, I\'ve been struggling to get it to work.

My use case is that I want some code t

2条回答

囚心锁ツ (楼主)

2021-01-06 16:55

It is possible to use an AutoDetectParser to extract images, without relying on PDFParser. This code works just as well for extracting images out from docx, pptx, etc.

Here I have a parseDocument() and a setPdfConfig() function which makes use of an AutoDetectParser.

I create an AutoDetectParser
Attach a EmbeddedDocumentExtractor onto a ParseContext.
Attach the AutoDetectParser onto the same ParseContext.
Attach a PDFParserConfig onto the same ParseContext.
Then give that ParseContext to AutoDetectParser.parse().

The images are saved to a folder in the same location as the source file, with the name _/.

private static void setPdfConfig(ParseContext context) {
    PDFParserConfig pdfConfig = new PDFParserConfig();
    pdfConfig.setExtractInlineImages(true);
    pdfConfig.setExtractUniqueInlineImagesOnly(true);

    context.set(PDFParserConfig.class, pdfConfig);
}

private static String parseDocument(String path) {
    String xhtmlContents = "";

    AutoDetectParser parser = new AutoDetectParser();
    ContentHandler handler = new ToXMLContentHandler();
    Metadata metadata = new Metadata();
    ParseContext context = new ParseContext();
    EmbeddedDocumentExtractor embeddedDocumentExtractor = 
            new EmbeddedDocumentExtractor() {
        @Override
        public boolean shouldParseEmbedded(Metadata metadata) {
            return true;
        }
        @Override
        public void parseEmbedded(InputStream stream, ContentHandler handler, Metadata metadata, boolean outputHtml)
                throws SAXException, IOException {
            Path outputDir = new File(path + "_").toPath();
            Files.createDirectories(outputDir);

            Path outputPath = new File(outputDir.toString() + "/" + metadata.get(Metadata.RESOURCE_NAME_KEY)).toPath();
            Files.deleteIfExists(outputPath);
            Files.copy(stream, outputPath);
        }
    };

    context.set(EmbeddedDocumentExtractor.class, embeddedDocumentExtractor);
    context.set(AutoDetectParser.class, parser);

    setPdfConfig(context);

    try (InputStream stream = new FileInputStream(path)) {
        parser.parse(stream, handler, metadata, context);
        xhtmlContents = handler.toString();
    } catch (IOException e) {
        e.printStackTrace();
    } catch (SAXException | TikaException e) {
        e.printStackTrace();
    }

    return xhtmlContents;
}

0 讨论(0)

查看其它2个回答