I am using PDFBOX and itextsharp dll and processing a pdf. so that I get the text coordinates of the text within a rectangle. the rectangle coordinates are extracted using the i
if ((mediabox.Top - mediabox.Height) != 0)
{
topY = mediabox.Top;
heightY = mediabox.Height;
diffY = topY - heightY;
lly_adjust = (topY - ury) + diffY;
ury_adjust = (topY - lly) + diffY;
}
else if ((cropbox.Top - cropbox.Height) != 0)
{
topY = mediabox.Top;
heightY = cropbox.Top;
diffY = topY - heightY;
lly_adjust = (topY - ury) - diffY;
ury_adjust = (topY - lly) - diffY;
}
else
{
lly_adjust = mediabox.Top - ury;
ury_adjust = mediabox.Top - lly;
}
These are final adjustment done
The coordinate system in PDF is defined in ISO-32000-1. This ISO standard explains that the X-axis is oriented towards the right, whereas the Y-axis has an upward orientation. This is the default. These are the coordinates that are returned by iText (behind the scenes, iText resolves all CTM transformations).
If you want to transform the coordinates returned by iText so that you get coordinates in a coordinate system where the Y axis has a downward orientation, you could for instance subtract the Y value returned by iText from the Y-coordinate of the top of the page.
An example: Suppose that we are dealing with an A4 page, where the Y coordinate of the bottom is 0 and the Y coordinate of the top is 842. If you have Y coordinates such as y1 = 806
and y2 = 36
, then you can do this:
y = 842 - y;
Now y1 = 36
and y2 = 806
. You have just reversed the orientation of the Y-axis using nothing more than simple high-school math.
Update based on an extra comment:
Each page has a media box. This defines the most important page boundaries. Other page boundaries may be present, but none of them shall exceed the media box (if they do, then your PDF is in violation with ISO-32000-1).
The crop box defines the visible area of the page. By default (for instance if a crop box entry is missing), the crop box coincides with the media box.
In your comment, you say that you subtract llx from the height. This is incorrect. llx
is the lower-left x coordinate, whereas the height is a property measured on the Y axis, unless the page is rotated. Did you check if the page dictionary has a /Rotate
value?
You also claim that the values returned by iText do not match the values returned by PdfBox. Note that the values returned by iText conform with the coordinate system as defined by the ISO standard. If PdfBox doesn't follow this standard, you should ask the people from PdfBox why they didn't follow the standard, and what coordinate system they are using instead.
Maybe that's what mkl's comment is about. He wrote:
Y' = Ymax - Y. X' = X - Xmin.
Maybe PdfBox searches for the maximum Y value Ymax
and the minimum X value Xmin
and then applies the above transformation on all coordinates. This is a useful transformation if you want to render a PDF, but it's unwise to perform such an operation if you want to use the coordinates, for instance to add content at specific positions relative to text on the page (because the transformed coordinates are no longer "PDF" coordinates).
Remark:
You say you need PdfBox to get the text of a page. Why do you need this extra tool? iText is perfectly capable of extracting and reordering the text on a page (assuming that you use the correct extraction strategy). If not, please clarify.