Apache PDFBOX - getting java.lang.OutOfMemoryError when using split(PDDocument document)

后端 未结 1 1567
栀梦
栀梦 2021-01-25 22:27

I am trying to split a document with a decent 300 pages using Apache PDFBOX API V2.0.2. While trying to split the pdf file to single pages using the following code:



        
相关标签:
1条回答
  • 2021-01-25 23:30

    PDF Box stores the parts resulted from the split operation as objects of type PDDocument in the heap as objects, which results in heap getting filled fast, and even if you call the close() operation after every round in the loop, still the GC will not be able to reclaim the heap size in the same manner it gets filled.

    An option is to split the document split operation to batches, in which each batch is a relatively manageable chunk (10 to 40 pages)

    public void execute() {
        File inputFile = new File(path/to/the/file.pdf);
        PDDocument document = null;
        try {
            document = PDDocument.load(inputFile);
    
            int start = 1;
            int end = 1;
            int batchSize = 50;
            int finalBatchSize = document.getNumberOfPages() % batchSize;
            int noOfBatches = document.getNumberOfPages() / batchSize;
            for (int i = 1; i <= noOfBatches; i++) {
                start = end;
                end = start + batchSize;
                System.out.println("Batch: " + i + " start: " + start + " end: " + end);
                split(document, start, end);
            }
            // handling the remaining
            start = end;
            end += finalBatchSize;
            System.out.println("Final Batch  start: " + start + " end: " + end);
            split(document, start, end);
    
        } catch (IOException e) {
            e.printStackTrace();
        } finally {
            //close the document
        }
    }
    
    private void split(PDDocument document, int start, int end) throws IOException {
        List<File> fileList = new ArrayList<File>();
        Splitter splitter = new Splitter();
        splitter.setStartPage(start);
        splitter.setEndPage(end);
        List<PDDocument> splittedDocuments = splitter.split(document);
        String outputPath = Config.INSTANCE.getProperty("outputPath");
        PDFTextStripper stripper = new PDFTextStripper();
    
        for (int index = 0; index < splittedDocuments.size(); index++) {
            String pdfFullPath = document.getDocumentInformation().getTitle() + index + start+ ".pdf";
            PDDocument splittedDocument = splittedDocuments.get(index);
    
            splittedDocument.save(pdfFullPath);
        }
    }
    
    0 讨论(0)
提交回复
热议问题