PDF find out if text is underlined or a table cell

后端 未结 5 1854
遇见更好的自我
遇见更好的自我 2020-12-30 18:45

I have been playing around with PdfBox and PDFTextStripperByArea method.

I was able to extract information if the text is bold or italic, b

相关标签:
5条回答
  • 2020-12-30 19:16

    Here is what I have found out so far:

    PDFBox uses a resource file to bound PDF operators/instructions to certain classes which then process the information.

    If we take a look at the PDFTextStripper.properties resource file under:

    pdfbox\src\main\resources\org\apache\pdfbox\resources\

    we can see that for instance the BT operator is bound to the org.apache.pdfbox.util.operator.BeginText class and so on.

    The PDFTextStripper under

    pdfbox\src\main\java\org\apache\pdfbox\util\

    takes this into account and utilizes the processing of the PDF with this classes.

    BUT all graphical objects are ignored, therefore no information of underline or table structure!

    Now if we take a look at the PageDrawer.properties resource file we can see that this one bounds to almost all operators available. Which is utilized by PageDrawer class under

    pdfbox\src\main\java\org\apache\pdfbox\pdfviewer\

    The "trick" is now to find out which graphical operators are those who represent underline and tables and to use them in combination with PDFTextStripper.

    Now this would mean reading the PDF file specification, which is currently way to much work.

    If someone knows which operators are responsible for which actions to draw underlines and table lines please let me know.

    0 讨论(0)
  • 2020-12-30 19:21

    you can use Itext to generate pdf reports.

    by using itext you can able to put the lines in easy way.

    try the follwing.

    document.add(new LineSeparator(0.5f, 50, null, 0, 198));

    the above code is used to generate lines in pdf report. and set the dimensions according to your choice.

    hope this will help you.

    0 讨论(0)
  • 2020-12-30 19:28

    As you mention -- PDFBox uses resource files, to bind PDF operators/ instructions to visitors which will process the information.

    You'd probably best start by copying PDFBox's existing visitor into your own source-folder, and then adding/ extending the implementation from there.

    My long-ago PostScript experience recalls 'moveto' and 'lineto' operators. Since PDF is roughly PS-based, you'll be looking for something similar.

    http://learnpostscript.wordpress.com/category/lineto/

    PDF format is a b*tch -- it's HTML, done wrong. It represents graphical implementation, not semantics. Even reconstructing sentences is difficult -- words or even individual characters are positioned, the 'space' or 'newline' must be algorithmically reconstructed. In short, Adobe are a*holes. And Reader is an non-ergonomic, bug-riddled, insecure, bloated pig.

    However, you can accomplish your requirement -- if you are willing to put, say, 12+ hours of work in. As well as detecting by position, underlines will typically be emitted in the PDF immediately after their text.. so you can latch your detection by PDF document-order, not just page position.

    Also, try constructing a trivial two-line PDF with underlined text. Then see what you can make of it, parsing it back in! The underline should stick out like dog's bananas, and once you can detect that, you'll be well on the way.

    PDFBox is not very good for extensibility, it's mainly just a big pile of algorithms. For this reason, just copy the PDFTextStripper source (and maybe have PageDrawer for reference) and prototype from there.

    Hope this helps!

    0 讨论(0)
  • 2020-12-30 19:38

    According to the api getfont() returns The font size.

    You can use getStyle() method and it will return STYLE_UNDERLINE for underlined font. Thus you can retrieve underline style.

    0 讨论(0)
  • 2020-12-30 19:39

    As far as I have understood the pdfbox, there is no option by which you can read underline. Maybe you can try itextpdf for this purpose.

    0 讨论(0)
提交回复
热议问题