How can I check if PDF page is image(scanned) by PDFBOX, XPDF

狂风中的少年 提交于 2019-12-12 04:13:20

问题


PDFBox problem on extract images. Hi, how I can check if pdf page is image and to extract that by PDFBOX library, there is a method to get images but if PDF Page is a Image it is not getting. could some one help me to solve this problem.

Xpdf problem on extract images. I try to extract images by another library xpdf it do strange flip on the page if it is a image. If pdf contain an small image as object image it give me ok, if page is scanned he us doing flip.

I want to extract the all Images from PDF, if PAGE is scanned to get them as image, if Page contain plain text and Images also to get Images from this page.

My point is to extract all Images from PDF. not only form a page but even if page is a image to extract them as image but do not skip them how is doing I think PDFbox.

XPDF is doing some thing but there is a problem FLIP(top,right) on page when he export scanned page

How can I solve this problem thanks.

Download File example for to test

    `PDDocument document = PDDocument.load(new File("/home/dru/IdeaProjects2/PDFExtractor/test/t1.pdf"));
    PDPageTree list = document.getPages();

    for (PDPage page : list) {
        PDResources pdResources = page.getResources();
        System.out.println(pdResources.getResourceCache());

        for (COSName c : pdResources.getXObjectNames()) {
            PDXObject o = pdResources.getXObject(c);

            if (o instanceof org.apache.pdfbox.pdmodel.graphics.image.PDImageXObject) {
                File file = new File("/home/dru/IdeaProjects2/PDFExtractor/test/out/" + System.nanoTime() + ".png");
                ImageIO.write(((org.apache.pdfbox.pdmodel.graphics.image.PDImageXObject)o).getImage(), "png", file);
            }
        }
    }`

回答1:


Extract images properly

As the updated PDF makes clear the problem is that it does not have any images immediately on the page but it has form xobjects drawn onto it which do contain images. Thus, the image search has to recurse into the form xobjects.

And that is not all: All pages in the updated PDF share the same resources dictionary, they merely pick a different of its form xobjects to display. Thus, one really has to parse the respective page content stream to determine which xobject (with which images) is present on a given page.

Actually this is something the PDFBox tool ExtractImages does. Unfortunately, though, it does not show the page it found the image in question on, cf. the ExtractImages.java test method testExtractPageImagesTool10948New.

But we can simply borrow from the technique used by that tool:

PDDocument document = PDDocument.load(resource);
int page = 1;
for (final PDPage pdPage : document.getPages())
{
    final int currentPage = page;
    PDFGraphicsStreamEngine pdfGraphicsStreamEngine = new PDFGraphicsStreamEngine(pdPage)
    {
        int index = 0;

        @Override
        public void drawImage(PDImage pdImage) throws IOException
        {
            if (pdImage instanceof PDImageXObject)
            {
                PDImageXObject image = (PDImageXObject)pdImage;
                File file = new File(RESULT_FOLDER, String.format("10948-new-engine-%s-%s.%s", currentPage, index, image.getSuffix()));
                ImageIOUtil.writeImage(image.getImage(), image.getSuffix(), new FileOutputStream(file));
                index++;
            }
        }

        @Override
        public void appendRectangle(Point2D p0, Point2D p1, Point2D p2, Point2D p3) throws IOException { }

        @Override
        public void clip(int windingRule) throws IOException { }

        @Override
        public void moveTo(float x, float y) throws IOException {  }

        @Override
        public void lineTo(float x, float y) throws IOException { }

        @Override
        public void curveTo(float x1, float y1, float x2, float y2, float x3, float y3) throws IOException {  }

        @Override
        public Point2D getCurrentPoint() throws IOException { return null; }

        @Override
        public void closePath() throws IOException { }

        @Override
        public void endPath() throws IOException { }

        @Override
        public void strokePath() throws IOException { }

        @Override
        public void fillPath(int windingRule) throws IOException { }

        @Override
        public void fillAndStrokePath(int windingRule) throws IOException { }

        @Override
        public void shadingFill(COSName shadingName) throws IOException { }
    };
    pdfGraphicsStreamEngine.processPage(pdPage);
    page++;
}

(ExtractImages.java test method testExtractPageImages10948New)

This code outputs images with file names "10948-new-engine-1-0.tiff", "10948-new-engine-2-0.tiff", "10948-new-engine-3-0.tiff", and "10948-new-engine-4-0.tiff", i.e. one per page.

PS: Please remember to include com.github.jai-imageio:jai-imageio-core in your classpath, it is required for TIFF output.

Flipped images

Another issue of the OP was that the images sometimes appear flipped upside-down, e.g. in case of his now newest sample file "t1_edited.pdf". The reason is that those images indeed are stored upside-down as image resources in the PDF.

When those images are drawn onto a page, the current transformation matrix in effect at that time mirrors the image drawn vertically and so creates the expected appearance.

By slightly enhancing the drawImage implementation in the code above, one can include indicators of such flips in the names of the exported images:

public void drawImage(PDImage pdImage) throws IOException
{
    if (pdImage instanceof PDImageXObject)
    {
        Matrix ctm = getGraphicsState().getCurrentTransformationMatrix();
        String flips = "";
        if (ctm.getScaleX() < 0)
            flips += "h";
        if (ctm.getScaleY() < 0)
            flips += "v";
        if (flips.length() > 0)
            flips = "-" + flips;
        PDImageXObject image = (PDImageXObject)pdImage;
        File file = new File(RESULT_FOLDER, String.format("t1_edited-engine-%s-%s%s.%s", currentPage, index, flips, image.getSuffix()));
        ImageIOUtil.writeImage(image.getImage(), image.getSuffix(), new FileOutputStream(file));
        index++;
    }
}

Now vertically or horizontally flipped images are marked accordingly.



来源:https://stackoverflow.com/questions/40531871/how-can-i-check-if-pdf-page-is-imagescanned-by-pdfbox-xpdf

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!