How to change the coordinates of a text in a pdf page from lower left to upper left

前端 未结 2 1282
既然无缘
既然无缘 2021-01-25 15:03

I am using PDFBOX and itextsharp dll and processing a pdf. so that I get the text coordinates of the text within a rectangle. the rectangle coordinates are extracted using the i

相关标签:
2条回答
  • 2021-01-25 15:20
              if ((mediabox.Top - mediabox.Height) != 0)
                {
                    topY = mediabox.Top;
                    heightY = mediabox.Height;
                    diffY = topY - heightY;
                    lly_adjust = (topY - ury) + diffY;
                    ury_adjust = (topY - lly) + diffY;
                }
                else if ((cropbox.Top - cropbox.Height) != 0)
                {
                    topY = mediabox.Top;
                    heightY = cropbox.Top;
                    diffY = topY - heightY;
                    lly_adjust = (topY - ury) - diffY;
                    ury_adjust = (topY - lly) - diffY;
    
                }
                else
                {
    
                    lly_adjust = mediabox.Top - ury;
                    ury_adjust = mediabox.Top - lly;
    
                }
    

    These are final adjustment done

    0 讨论(0)
  • 2021-01-25 15:32

    The coordinate system in PDF is defined in ISO-32000-1. This ISO standard explains that the X-axis is oriented towards the right, whereas the Y-axis has an upward orientation. This is the default. These are the coordinates that are returned by iText (behind the scenes, iText resolves all CTM transformations).

    If you want to transform the coordinates returned by iText so that you get coordinates in a coordinate system where the Y axis has a downward orientation, you could for instance subtract the Y value returned by iText from the Y-coordinate of the top of the page.

    An example: Suppose that we are dealing with an A4 page, where the Y coordinate of the bottom is 0 and the Y coordinate of the top is 842. If you have Y coordinates such as y1 = 806 and y2 = 36, then you can do this:

    y = 842 - y;
    

    Now y1 = 36 and y2 = 806. You have just reversed the orientation of the Y-axis using nothing more than simple high-school math.

    Update based on an extra comment:

    Each page has a media box. This defines the most important page boundaries. Other page boundaries may be present, but none of them shall exceed the media box (if they do, then your PDF is in violation with ISO-32000-1).

    The crop box defines the visible area of the page. By default (for instance if a crop box entry is missing), the crop box coincides with the media box.

    In your comment, you say that you subtract llx from the height. This is incorrect. llx is the lower-left x coordinate, whereas the height is a property measured on the Y axis, unless the page is rotated. Did you check if the page dictionary has a /Rotate value?

    You also claim that the values returned by iText do not match the values returned by PdfBox. Note that the values returned by iText conform with the coordinate system as defined by the ISO standard. If PdfBox doesn't follow this standard, you should ask the people from PdfBox why they didn't follow the standard, and what coordinate system they are using instead.

    Maybe that's what mkl's comment is about. He wrote:

    Y' = Ymax - Y. X' = X - Xmin.

    Maybe PdfBox searches for the maximum Y value Ymax and the minimum X value Xmin and then applies the above transformation on all coordinates. This is a useful transformation if you want to render a PDF, but it's unwise to perform such an operation if you want to use the coordinates, for instance to add content at specific positions relative to text on the page (because the transformed coordinates are no longer "PDF" coordinates).

    Remark:

    You say you need PdfBox to get the text of a page. Why do you need this extra tool? iText is perfectly capable of extracting and reordering the text on a page (assuming that you use the correct extraction strategy). If not, please clarify.

    • Note that we recently decided to support Type3 fonts, although we weren't convinced that this makes sense (see Text extraction is empty and unknown for text has type3 font using PDFBox,iText (difficult topic!) to understand why not).
    • What some consider "wrong extraction" can often be "wrong interpretation" of what is extracted as explained in this mailing-list answer: http://thread.gmane.org/gmane.comp.java.lib.itext.general/66829/focus=66830
    • There are other cases where we follow the spec, leading to results that are different than what PdfBox returns. Watch https://www.youtube.com/watch?v=wxGEEv7ibHE for more info.
    0 讨论(0)
提交回复
热议问题