iTextSharp - How to get the position of word on a page

后端 未结 1 1321
别那么骄傲
别那么骄傲 2020-11-29 07:58

I am using iTextSharp and the reader.GetPageContent method to pull the text out of a PDF. I need to find the rectangle/position for each word found in the document. Is the

相关标签:
1条回答
  • 2020-11-29 08:12

    Yes there is. Check out the text.pdf.parser package, specifically LocationTextExtractionStrategy. Actually, that might not do the trick either. You'll probably want to write your own TextExtractionStrategy to feed into PdfTextExtractor:

    MyTexExStrat strat = new MyTexExStrat();
    PdfTextExtractor.getTextFromPage(reader, pageNum, strat);
    // get the strings-n-rects from strat.
    
    public class MyTexExStrat implements TextExtractionStrategy {
        void beginTextBlock() {}
        void endTextBlock() {}
        void renderImage(ImageRenderInfo info) {}
        void renderText(TextRenderInfo info) {
          // track text and location here.
        }
    }
    

    You'll probably want to look at the source for LocationTextExtractionStrategy to see how it combines text that shares a baseline. You might even just modify LTES to store parallel arrays of strings and rects.

    PS: to build the rects, you can just get the AscentLine & DescentLine and use those coordinates as the top and bottom corners:

    Vector bottomLeft = info.getDescentLine().getStartPoint();
    Vector topRight = info.getAscentLine().getEndPoint();
    Rectangle rect = new Rectangle(bottomLeft.get(Vector.I1),
                                   bottomLeft.get(Vector.I2),
                                   topRight.get(Vector.I1),
                                   topRight.get(Vector.I2));
    

    Warning: The above code ass-u-mes that the text is horizontal and proceeds from left to right. Rotated text will screw it up, as will vertical text or right-to-left (Arabic, Hebrew) text. For most applications, the above should be fine, but know it's limits.

    Good hunting.

    0 讨论(0)
提交回复
热议问题