I am trying to populate repeated forms with PDFbox. I am using a TreeMap and populating the forms with individual records. The format of the pdf form is such that there are si
Even though the above answer was marked as the solution to the problem, since the solution is buried in the comments, I wanted to add this answer at this level. I spent several hours searching for the solution.
My code snippets and comments.
// Collection solely for purpose of preventing premature garbage collection
List<PDDocument> sourceDocuments = new ArrayList<>( );
...
// Source document (actually inside a loop)
PDDocument docIn = PDDocument.load( artifactBytes );
// Add document to collection before using it to prevent the problem
sourceDocuments.add( docIn );
// Extract from source document
PDPage extractedPage = docIn.getPage( 0 );
// Add page to destination document
docOut.addPage( extractedPage );
...
// This was failing with "COSStream has been closed and cannot be read."
// Now it works.
docOut.save( bundleStream );
You appear to get the warning wrong. It says:
Warning: You did not close a PDF Document
So in contrast to what you think, "PDFbox saying PDDocument closed when its not", PDFBox says that you did not close a document!
After your edit one sees that it actually says that a COSStream
has been closed and that a possible cause is that the enclosing PDDocument
already has been closed. This is a mere possibility!
That been said, by adding pages from one document to another you probably end up having references to those pages from both documents. In that case in the course of closing both documents (e.g. automatically via garbage collection), the second one closing may indeed stumble across some already closed COSStream
instances.
So my first advice to simply do close the documents at the end by
doc.close();
newDoc.close();
probably won't remove the warnings, merely change their timing.
Actually you don't merely create two documents doc
and newDoc
, you even create new PDDocument
instances and assign them to doc
again and again, in the process setting the former document objects in that variable free for garbage collection. So you eventually have a big bunch of documents to be closed as soon as not referenced anymore.
I don't think it would be a good idea to close all those documents in doc
early, in particular not before saving newDoc
.
But if your code will eventually be run as part of a larger application instead of as a small, one-shot test application, you should collect all those PDDocument
instances in some Collection
and explicitly close them right after saving newDoc
and then clear the collection.
Actually your exception looks like one of those lost PDDocument
instances has already been closed by garbage collection, so you should collect the documents even in case of a simple one-shot utility to keep them from being GC disposed.
(@Tilman, please correct me if I'm wrong...)
To prevent problems with different documents sharing pages, you can try and import the pages to the target document and thereafter add the imported page to the target document page tree. I.e. replace
newDoc.addPage(doc.getPage(0));
newDoc.addPage(doc.getPage(1));
by
newDoc.addPage(newDoc.importPage(doc.getPage(0)));
newDoc.addPage(newDoc.importPage(doc.getPage(1)));
This should allow you to close each PDDocument
instance in doc
before losing it.
There are certain drawbacks to this, though, cf. the method JavaDoc and this answer here.
In your combined document you will have many fields with the same name (at least in case of a sufficiently high number of entries in your CSV file) which you initially set to different values. And you access the fields from the PDAcroForm
of the respective original document but don't add them to the PDAcroForm
of the combined result document.
This is asking for trouble! The PDF format does consider forms to be document-wide with all fields referenced (directly or indirectly) from the AcroForm dictionary of the document, and it expects fields with the same name to effectively be different visualizations of the same field and therefore to all have the same value.
Thus, PDF processors might handle your document fields in unexpected ways, e.g.
In particular programmatic reading of your PDF field values will fail because in that context the form is definitively considered document-wide and based in AcroForm. PDF viewers on the other hand might first show your set values and make look things ok.
To prevent this you should rename the fields before merging. You might consider using the PDFMergerUtility
which does such a renaming under the hood. For an example usage of that utility class have a look at the PDFMergerExample
.