My question is:
How can I extract text from a PDF file which is divided in columns in a way that I get the result separated by this columns?
Background: I wo
If I set setSortByPosition() of a PDFTextStripper on true all signs of a page are set in a line without recognizing separate columns. But if I set setSortByPosition() on false the stripper is doing this division.
[...] How is PDFTextStripper calculating column breaks?
It isn't.
By setting SortByPosition
to false
you tell PDFBox to not try to sort the text pieces from the page content stream but to instead accept them in the order they appear.
In your document the text pieces seem to be drawn in the reading order, i.e. column by column. This is not true for all documents, and to cope with other documents PDFBox offers the option of sorting the text pieces left-to-right, top-to-bottom.
Activating that option (setting SortByPosition
to true
) in your document returns the text without respect to the columns.
Are there methods in the pdfBox API to catch this / to extract the text by columns?
PDFBox does not analyze the page content to recognize columns. If you do the analysis, though, it allows you to extract text column by column if you provide the column rectangles as reguions.