发表新帖

发表新帖

extract PDF text by columns

前端未结

关注

 2  1171

时光取名叫无心 2021-01-14 19:41

My question is:

How can I extract text from a PDF file which is divided in columns in a way that I get the result separated by this columns?

Background: I wo

2条回答

无人共我 (楼主)

2021-01-14 19:56

If I set setSortByPosition() of a PDFTextStripper on true all signs of a page are set in a line without recognizing separate columns. But if I set setSortByPosition() on false the stripper is doing this division.

[...] How is PDFTextStripper calculating column breaks?

It isn't.

By setting SortByPosition to false you tell PDFBox to not try to sort the text pieces from the page content stream but to instead accept them in the order they appear.

In your document the text pieces seem to be drawn in the reading order, i.e. column by column. This is not true for all documents, and to cope with other documents PDFBox offers the option of sorting the text pieces left-to-right, top-to-bottom.

Activating that option (setting SortByPosition to true) in your document returns the text without respect to the columns.

Are there methods in the pdfBox API to catch this / to extract the text by columns?

PDFBox does not analyze the page content to recognize columns. If you do the analysis, though, it allows you to extract text column by column if you provide the column rectangles as reguions.

0 讨论(0)

查看其它2个回答
发布评论:

提交评论
- 加载中...

热议问题