Not able to extract images from PDFA1-a format document

前端 未结 1 1554

I am using the following code for extracting images from pdf which is in PDFA1-a format but I am not able to get the images .

List list = document.         


        
1条回答
  •  花落未央
    2021-01-27 07:59

    Your problems are a combination of two problems:

    1) the "break;". Your file has two images. The first one is transparent or grey or whatever and JPEG encoded, but it isn't the one you want. The second one is the one you want but the break aborts after the first image. So I just changed a code segment of yours to this:

    while (imageIter.hasNext())
    {
         String key = (String) imageIter.next();
         PDXObjectImage pdxObjectImage = (PDXObjectImage) pageImages.get(key);
         System.out.println(totalImages);
         pdxObjectImage.write2file("C:\\SOMEPATH\\" + fileName + "_" + totalImages);
         totalImages++;
    
         //break;
     }
    

    2) Your second image (the interesting one) is JBIG2 encoded. To decode this, you need to add the levigo plugin your class path, as mentioned here. If you don't, you'll get this message in 1.8.8, unless you disabled logging:

    ERROR [main] org.apache.pdfbox.filter.JBIG2Filter:69 - Can't find an ImageIO plugin to decode the JBIG2 encoded datastream.
    

    (You didn't get that error message because it is the second one that is JBIG2 encoded)

    Three bonus hints:

    3) if you created this image yourself, e.g. on a photocopy machine, find out how to get PDF images without JBIG2 compression, it is somewhat risky.

    4) don't use pdResources.getImages(), the getImages call is deprecated. Instead, use getXObjects(), and then check the type of what you get when iterating.

     Iterator imageIter = pageImages.keySet().iterator();
     while (imageIter.hasNext())
     {
         String key = (String) imageIter.next();
         Object o = pageImages.get(key);
         if (o instanceof PDXObjectImage)
         {
             PDXObjectImage pdxObjectImage = (PDXObjectImage) o;
    
             // do stuff
         }
     }
    

    5) use a foreach loop.

    And if it wasn't already obvious: this has nothing to do with PDF/A :-)

    6) I forgot you also asked how to see if it is a b/w image, here's some simple code (not optimized) that I mentioned in the comments:

    BufferedImage bim = pdxObjectImage.getRGBImage();
    
    boolean bwImage = true;
    
    int w = bim.getWidth();
    int h = bim.getHeight();
    for (int y = 0; y < h; y++)
    {
        for (int x = 0; x < w; x++)
        {
            Color c = new Color(bim.getRGB(x, y));
            int red = c.getRed();
            int green = c.getGreen();
            int blue = c.getBlue();
            if (red == 0 && green == 0 && blue == 0)
            {
                continue;
            }
            if (red == 255 && green == 255 && blue == 255)
            {
                continue;
            }
            bwImage = false;
            break;
        }
        if (!bwImage)
            break;
    }
    System.out.println(bwImage);
    

    0 讨论(0)
提交回复
热议问题