Text coordinates when stripping from PDFBox

前端 未结 2 1893
轻奢々
轻奢々 2020-12-06 20:36

i\'m trying to extract text with coordinates from a pdf file using PDFBox.

I mixed some methods/info found on internet (stackoverflow too), but the problem i have th

相关标签:
2条回答
  • 2020-12-06 21:38

    This is just another case of the excessive PdfTextStripper coordinate normalization. Just like you I had thought that by using TextPosition.getTextMatrix() (instead of getX() and getY) one would get the actual coordinates, but no, even these matrix values have to be corrected (at least in PDFBox 2.0.x, I haven't checked 1.8.x) because the matrix is multiplied by a translation making the lower left corner of the crop box the origin.

    Thus, in your case (in which the lower left of the crop box is not the origin), you have to correct the values, e.g. by replacing

            float x = minx;
            float y = firstPosition.getTextMatrix().getTranslateY();
    

    by

            PDRectangle cropBox = doc.getPage(0).getCropBox();
    
            float x = minx + cropBox.getLowerLeftX();
            float y = firstPosition.getTextMatrix().getTranslateY() + cropBox.getLowerLeftY();
    

    Instead of

    you now get

    Obviously, though, you will also have to correct the height somewhat. This is due to the way the PdfTextStripper determines the text height:

        // 1/2 the bbox is used as the height todo: why?
        float glyphHeight = bbox.getHeight() / 2;
    

    (from showGlyph(...) in LegacyPDFStreamEngine, the parent class of PdfTextStripper)

    While the font bounding box indeed usually is too large, half of it often is not enough.

    0 讨论(0)
  • 2020-12-06 21:38

    The following code worked for me:

        // Definition of font baseline, ascent, descent: https://en.wikipedia.org/wiki/Ascender_(typography)
        //
        // The origin of the text coordinate system is the top-left corner where Y increases downward.
        // TextPosition.getX(), getY() return the baseline.
        TextPosition firstLetter = textPositions.get(0);
        TextPosition lastLetter = textPositions.get(textPositions.size() - 1);
    
        // Looking at LegacyPDFStreamEngine.showGlyph(), ascender and descender heights are calculated like
        // CapHeight: https://stackoverflow.com/a/42021225/14731
        float ascent = firstLetter.getFont().getFontDescriptor().getAscent() / 1000 * lastLetter.getFontSize();
        Point topLeft = new Point(firstLetter.getX(), firstLetter.getY() - ascent);
    
        float descent = lastLetter.getFont().getFontDescriptor().getDescent() / 1000 * lastLetter.getFontSize();
        // Descent is negative, so we need to negate it to move downward.
        Point bottomRight = new Point(lastLetter.getX() + lastLetter.getWidth(),
            lastLetter.getY() - descent);
    
        float descender = lastLetter.getFont().getFontDescriptor().getDescent() / 1000 * lastLetter.getFontSize();
        // Descender height is negative, so we need to negate it to move downward
        Point bottomRight = new Point(lastLetter.getX() + lastLetter.getWidth(),
            lastLetter.getY() - descender);
    

    In other words, we are creating a bounding box from the font's ascender down to its descender.

    If you want to render these coordinates with the origin in the bottom-left corner, see https://stackoverflow.com/a/28114320/14731 for more details. You'll need to apply a transform like this:

    contents.transform(new Matrix(1, 0, 0, -1, 0, page.getHeight()));
    
    0 讨论(0)
提交回复
热议问题