Converting PDF to multipage tiff (Group 4)

后端 未结 5 1461
不思量自难忘°
不思量自难忘° 2020-12-02 00:32

I\'m trying to convert PDFs as represented by the org.apache.pdfbox.pdmodel.PDDocument class and the icafe library (https://github.com/dragon66/icafe/) to a multipage tiff w

相关标签:
5条回答
  • 2020-12-02 00:51

    Here's some code to save in a multipage tiff which I use with PDFBox. It requires the TIFFUtil class from PDFBox (it isn't public, so you have to make a copy).

    void saveAsMultipageTIFF(ArrayList<BufferedImage> bimTab, String filename, int dpi) throws IOException
    {
        Iterator<ImageWriter> writers = ImageIO.getImageWritersByFormatName("tiff");
        ImageWriter imageWriter = writers.next();
    
        ImageOutputStream ios = ImageIO.createImageOutputStream(new File(filename));
        imageWriter.setOutput(ios);
        imageWriter.prepareWriteSequence(null);
        for (BufferedImage image : bimTab)
        {
            ImageWriteParam param = imageWriter.getDefaultWriteParam();
            IIOMetadata metadata = imageWriter.getDefaultImageMetadata(new ImageTypeSpecifier(image), param);
            param.setCompressionMode(ImageWriteParam.MODE_EXPLICIT);
            TIFFUtil.setCompressionType(param, image);
            TIFFUtil.updateMetadata(metadata, image, dpi);
            imageWriter.writeToSequence(new IIOImage(image, null, metadata), param);
        }
        imageWriter.endWriteSequence();
        imageWriter.dispose();
        ios.flush();
        ios.close();
    }
    

    I experimented on this for myself some time ago by using this code: https://www.java.net/node/670205 (I used solution 2)

    However...

    If you create an array with lots of images, your memory consumption really goes up. So it would probably be better to render an image, then add it to the tiff file, then render the next page and lose the reference of the previous one so that the gc can get the space if needed.

    0 讨论(0)
  • 2020-12-02 00:53

    Inspired by Yusaku answer,

    I made my own version,

    This can convert multiple pdf pages to a byte array.

    I Used pdfbox 2.0.16 in combination with imageio-tiff 3.4.2

    //PDF converter to tiff toolbox method.
    private byte[] bytesToTIFF(@Nonnull byte[] in) {
    
            int dpi = 300;
            ImageWriter writer = ImageIO.getImageWritersByFormatName("TIFF").next();
    
            try(ByteArrayOutputStream imageBaos = new ByteArrayOutputStream(255)){
    
                writer.setOutput(ImageIO.createImageOutputStream(imageBaos));
                writer.prepareWriteSequence(null);
    
                PDDocument document = PDDocument.load(in);
                PDFRenderer pdfRenderer = new PDFRenderer(document);
                ImageWriteParam params = writer.getDefaultWriteParam();
    
                for (int page = 0; page < document.getNumberOfPages(); page++) {
                    BufferedImage image = pdfRenderer.renderImageWithDPI(page, dpi, ImageType.RGB);
                    IIOMetadata metadata = writer.getDefaultImageMetadata(new ImageTypeSpecifier(image), params);
                    writer.writeToSequence(new IIOImage(image, null, metadata), params);
                }
    
                LOG.trace("size found: {}", imageBaos.size());
    
                writer.endWriteSequence();
                writer.reset();
    
                return imageBaos.toByteArray();
    
            } catch (Exception ex) {
                LOG.warn("can't instantiate the bytesToTiff method with: PDF", ex);
            } finally {
                writer.dispose();
            }
    }
    
    0 讨论(0)
  • 2020-12-02 00:58

    It's been a while since the question was asked and I finally find time and a wonderful ordered dither matrix which allows me to give some details on how "icafe" can be used to get similar or better results than calling external ghostscript executable. Some new features were added to "icafe" recently such as better quantization and ordered dither algorithms which is used in the following example code.

    Here the sample pdf I am going to use is princeCatalogue. Most of the following code is from the OP with some changes due to package name change and more ImageParam control settings.

    import java.awt.image.BufferedImage;
    import java.io.FileOutputStream;
    import java.io.IOException;
    
    import org.apache.pdfbox.pdmodel.PDDocument;
    import org.apache.pdfbox.pdmodel.PDPage;
    
    import com.icafe4j.image.ImageColorType;
    import com.icafe4j.image.ImageParam;
    import com.icafe4j.image.options.TIFFOptions;
    import com.icafe4j.image.quant.DitherMethod;
    import com.icafe4j.image.quant.DitherMatrix;
    import com.icafe4j.image.tiff.TIFFTweaker;
    import com.icafe4j.image.tiff.TiffFieldEnum.Compression;
    import com.icafe4j.io.FileCacheRandomAccessOutputStream;
    import com.icafe4j.io.RandomAccessOutputStream;
    
    public class Pdf2TiffConverter {
        public static void main(String[] args) {
            String pdf = "princecatalogue.pdf";
            PDDocument pddoc = null;
            try {
                pddoc = PDDocument.load(pdf);
            } catch (IOException e) {
            }
    
            try {
                savePdfAsTiff(pddoc);
            } catch (IOException e) {
            }
        }
    
        private static void savePdfAsTiff(PDDocument pdf) throws IOException {
            BufferedImage[] images = new BufferedImage[pdf.getNumberOfPages()];
            for (int i = 0; i < images.length; i++) {
                PDPage page = (PDPage) pdf.getDocumentCatalog().getAllPages()
                        .get(i);
                BufferedImage image;
                try {
    //              image = page.convertToImage(BufferedImage.TYPE_INT_RGB, 288); //works
                    image = page.convertToImage(BufferedImage.TYPE_INT_RGB, 300); // does not work
                    images[i] = image;
                } catch (IOException e) {
                    e.printStackTrace();
                }
            }
    
            FileOutputStream fos = new FileOutputStream("a.tiff");
            RandomAccessOutputStream rout = new FileCacheRandomAccessOutputStream(
                    fos);
            ImageParam.ImageParamBuilder builder = ImageParam.getBuilder();
            ImageParam[] param = new ImageParam[1];
            TIFFOptions tiffOptions = new TIFFOptions();
            tiffOptions.setTiffCompression(Compression.CCITTFAX4);
            builder.imageOptions(tiffOptions);
            builder.colorType(ImageColorType.BILEVEL).ditherMatrix(DitherMatrix.getBayer8x8Diag()).applyDither(true).ditherMethod(DitherMethod.BAYER);
            param[0] = builder.build();
            TIFFTweaker.writeMultipageTIFF(rout, param, images);
            rout.close();
            fos.close();
        }
    }
    

    For ghostscript, I used command line directly with the same parameters provided by the OP. The screenshots for the first page of the resulted TIFF images are showing below:

    The lefthand side shows the output of "ghostscript" and the righthand side the output of "icafe". It can be seen, at least in this case, the output from "icafe" is better than the output from "ghostscript".

    Using CCITTFAX4 compression, the file size from "ghostscript" is 2.22M and the file size from "icafe" is 2.08M. Both are not so good given the fact dither is used while creating the black and white output. In fact, a different compression algorithm will create way smaller file size. For example, using LZW, the same output from "icafe" is only 634K and if using DEFLATE compression the output file size went down to 582K.

    0 讨论(0)
  • 2020-12-02 01:02

    Refer to my github code for an implementation with PDFBox.

    0 讨论(0)
  • 2020-12-02 01:16

    Since some dependencies used by solutions for this problem looks not maintained. I got a solution by using latest version (2.0.16) pdfbox:

    ByteArrayOutputStream imageBaos = new ByteArrayOutputStream();
    ImageOutputStream output = ImageIO.createImageOutputStream(imageBaos);
    ImageWriter writer = ImageIO.getImageWritersByFormatName("TIFF").next();
    
    try (final PDDocument document = PDDocument.load(new File("/tmp/tmp.pdf"))) {
    
                PDFRenderer pdfRenderer = new PDFRenderer(document);
    
                int pageCount = document.getNumberOfPages();
    
                BufferedImage[] images = new BufferedImage[pageCount];
                // ByteArrayOutputStream[] baosArray = new ByteArrayOutputStream[pageCount];
    
                writer.setOutput(output);
    
                ImageWriteParam params = writer.getDefaultWriteParam();
    
                params.setCompressionMode(ImageWriteParam.MODE_EXPLICIT);
    
                // Compression: None, PackBits, ZLib, Deflate, LZW, JPEG and CCITT
                // variants allowed
                params.setCompressionType("Deflate");
    
                writer.prepareWriteSequence(null);
    
                for (int page = 0; page < pageCount; page++) {
                    BufferedImage image = pdfRenderer.renderImageWithDPI(page, DPI, ImageType.RGB);
                    images[page] = image;
                    IIOMetadata metadata = writer.getDefaultImageMetadata(new ImageTypeSpecifier(image), params);
                    writer.writeToSequence(new IIOImage(image, null, metadata), params);
                    // ImageIO.write(image, "tiff", baosArray[page]);
                }
    
                System.out.println("imageBaos size: " + imageBaos.size());
                // Finished write to output
    
                writer.endWriteSequence();
    
                document.close();
            } catch (IOException e) {
                e.printStackTrace();
                throw new Exception(e);
            } finally {
                // avoid memory leaks
                writer.dispose();
            }
    

    Then you may using imageBaos write to your local file. But if you want to pass your image to ByteArrayOutputStream and return to privious method like me. Then we need other steps.

    After processing is done, the image bytes would be available in the ImageOutputStream output object. We need to position the offset to the beginning of the output object and then read the butes to write to new ByteArrayOutputStream, a concise way like this:

    ByteArrayOutputStream bos = new ByteArrayOutputStream();
    long counter = 0; 
            while (true) {
                try {
                    bos.write(ios.readByte());
                    counter++;
                } catch (EOFException e) {
                    System.out.println("End of Image Stream");
                    break;
                } catch (IOException e) {
                    System.out.println("Error processing the Image Stream");
                    break;
                }
            }
    return bos
    

    Or you can just ImageOutputStream.flush() at end to get your imageBaos Byte then return.

    0 讨论(0)
提交回复
热议问题