Extract Images from PDF with Apache Tika

前端 未结 2 1525
情歌与酒
情歌与酒 2021-01-06 16:01

Apache Tika 1.6 has the ability to extract inline images from PDF documents. However, I\'ve been struggling to get it to work.

My use case is that I want some code t

相关标签:
2条回答
  • 2021-01-06 16:42

    Try the code bellow, ContentHandler turned has your xml content.

    public ContentHandler convertPdf(byte[] content, String path, String filename)throws IOException, SAXException, TikaException{           
    
        Metadata metadata = new Metadata();
        ParseContext context = new ParseContext();
        ContentHandler handler =   new ToXMLContentHandler();
        PDFParser parser = new PDFParser(); 
    
        PDFParserConfig config = new PDFParserConfig();
        config.setExtractInlineImages(true);
        config.setExtractUniqueInlineImagesOnly(true);
    
        parser.setPDFParserConfig(config);
    
    
        EmbeddedDocumentExtractor embeddedDocumentExtractor = 
                new EmbeddedDocumentExtractor() {
            @Override
            public boolean shouldParseEmbedded(Metadata metadata) {
                return true;
            }
            @Override
            public void parseEmbedded(InputStream stream, ContentHandler handler, Metadata metadata, boolean outputHtml)
                    throws SAXException, IOException {
                Path outputFile = new File(path+metadata.get(Metadata.RESOURCE_NAME_KEY)).toPath();
                Files.copy(stream, outputFile);
            }
        };
    
        context.set(PDFParser.class, parser);
        context.set(EmbeddedDocumentExtractor.class,embeddedDocumentExtractor );
    
        try (InputStream stream = new ByteArrayInputStream(content)) {
            parser.parse(stream, handler, metadata, context);
        }
    
        return handler;
    }
    
    0 讨论(0)
  • 2021-01-06 16:55

    It is possible to use an AutoDetectParser to extract images, without relying on PDFParser. This code works just as well for extracting images out from docx, pptx, etc.

    Here I have a parseDocument() and a setPdfConfig() function which makes use of an AutoDetectParser.

    1. I create an AutoDetectParser
    2. Attach a EmbeddedDocumentExtractor onto a ParseContext.
    3. Attach the AutoDetectParser onto the same ParseContext.
    4. Attach a PDFParserConfig onto the same ParseContext.
    5. Then give that ParseContext to AutoDetectParser.parse().

    The images are saved to a folder in the same location as the source file, with the name <sourceFile>_/.

    private static void setPdfConfig(ParseContext context) {
        PDFParserConfig pdfConfig = new PDFParserConfig();
        pdfConfig.setExtractInlineImages(true);
        pdfConfig.setExtractUniqueInlineImagesOnly(true);
    
        context.set(PDFParserConfig.class, pdfConfig);
    }
    
    private static String parseDocument(String path) {
        String xhtmlContents = "";
    
        AutoDetectParser parser = new AutoDetectParser();
        ContentHandler handler = new ToXMLContentHandler();
        Metadata metadata = new Metadata();
        ParseContext context = new ParseContext();
        EmbeddedDocumentExtractor embeddedDocumentExtractor = 
                new EmbeddedDocumentExtractor() {
            @Override
            public boolean shouldParseEmbedded(Metadata metadata) {
                return true;
            }
            @Override
            public void parseEmbedded(InputStream stream, ContentHandler handler, Metadata metadata, boolean outputHtml)
                    throws SAXException, IOException {
                Path outputDir = new File(path + "_").toPath();
                Files.createDirectories(outputDir);
    
                Path outputPath = new File(outputDir.toString() + "/" + metadata.get(Metadata.RESOURCE_NAME_KEY)).toPath();
                Files.deleteIfExists(outputPath);
                Files.copy(stream, outputPath);
            }
        };
    
        context.set(EmbeddedDocumentExtractor.class, embeddedDocumentExtractor);
        context.set(AutoDetectParser.class, parser);
    
        setPdfConfig(context);
    
        try (InputStream stream = new FileInputStream(path)) {
            parser.parse(stream, handler, metadata, context);
            xhtmlContents = handler.toString();
        } catch (IOException e) {
            e.printStackTrace();
        } catch (SAXException | TikaException e) {
            e.printStackTrace();
        }
    
        return xhtmlContents;
    }
    
    0 讨论(0)
提交回复
热议问题