I tried to read a stream and was hoping to get for each String the exact position (coordinates)
int size = reader.getXrefSize();
for (int i = 0; i
If you want to understand what the bytes are you're seeing for the Tj operator, have a look at the PDF specification: http://www.adobe.com/content/dam/Adobe/en/devnet/acrobat/pdfs/PDF32000_2008.pdf
More specifically - look at section 9.4.3. To paraphrase that section - each byte or potentially sequence of multiple bytes must be looked up in the font used to paint the text (in your example the font is identified as /F1). By looking it up you'll find the actual character this code refers to.
Also keep in mind that the order in which you see these text commands here might not reflect natural reading order at all, so you'll have to figure out based on the positions you find what actually the correct order of these characters is.
Also keep in mind that your PDF file might not contain spaces for example. Since a space can be "faked" by simply moving the next character a bit to the right, some PDF generators omit spaces. But finding a gap in coordinates might not be a word break. It could also be the end of a column for example.
This is really, really hard - especially if you are trying to do this on generic PDF files (as opposed to for only a few layouts that you know always come from the same source). I've written a text editor for PDF long ago for a product called PitStop Pro that is still around (no longer affiliated with it) and it was a really hard problem.
If that is an option, try using an existing library or tool. There are certainly commercial options for such a library or tool; I'm less familiar with open-source / free libraries so I can't comment on that.
As plinth and David van Driessche already pointed out in their answers, text extration from PDF file is non-trivial. Fortunately the classes in the parser package of iText do most of the heavy lifting for you. You have already found at least one class from that package,PdfTextExtractor,
but this class essentially is a convenience utility for using the parser functionality of iText if you're only interested in the plain text of the page. In your case you have to look at the classes in that package more intensely.
A starting point to get information on the topic of text extraction with iText is section 15.3 Parsing PDFs of iText in Action — 2nd Edition, especially the methodextractText
of the sample ParsingHelloWorld.java:
public void extractText(String src, String dest) throws IOException
{
PrintWriter out = new PrintWriter(new FileOutputStream(dest));
PdfReader reader = new PdfReader(src);
RenderListener listener = new MyTextRenderListener(out);
PdfContentStreamProcessor processor = new PdfContentStreamProcessor(listener);
PdfDictionary pageDic = reader.getPageN(1);
PdfDictionary resourcesDic = pageDic.getAsDict(PdfName.RESOURCES);
processor.processContent(ContentByteUtils.getContentBytesForPage(reader, 1), resourcesDic);
out.flush();
out.close();
}
which makes use of the RenderListener
implementation MyTextRenderListener.java:
public class MyTextRenderListener implements RenderListener
{
[...]
/**
* @see RenderListener#renderText(TextRenderInfo)
*/
public void renderText(TextRenderInfo renderInfo) {
out.print("<");
out.print(renderInfo.getText());
out.print(">");
}
}
While thisRenderListener
implementation merely outputs the text, the TextRenderInfo object it inspects offers way more information:
public LineSegment getBaseline(); // the baseline for the text (i.e. the line that the text 'sits' on)
public LineSegment getAscentLine(); // the ascentline for the text (i.e. the line that represents the topmost extent that a string of the current font could have)
public LineSegment getDescentLine(); // the descentline for the text (i.e. the line that represents the bottom most extent that a string of the current font could have)
public float getRise() ; // the rise which represents how far above the nominal baseline the text should be rendered
public String getText(); // the text to render
public int getTextRenderMode(); // the text render mode
public DocumentFont getFont(); // the font
public float getSingleSpaceWidth(); // the width, in user space units, of a single space character in the current font
public List<TextRenderInfo> getCharacterRenderInfos(); // details useful if a listener needs access to the position of each individual glyph in the text render operation
Thus, if yourRenderListener
in addition to inspecting the text withgetText()
also considersgetBaseline()
or evengetAscentLine()
andgetDescentLine().
you have all the coordinates you will likely need.
PS: There is a wrapper class for the code inParsingHelloWorld.extractText()
, PdfReaderContentParser, which allows you to simply write the following given aPdfReader reader,
anint page,
and aRenderListener renderListener:
PdfReaderContentParser parser = new PdfReaderContentParser(reader);
parser.processContent(page, renderListener);
If you're trying to do text extraction, you should be aware that the this is decidedly a non-trivial process. You will, at a minimum, have to implement an RPN machine to run the code and accumulate transformations and execute all the text operators. You will need to interpret the font metrics from the current set of page resources and you will likely need to understand the text encoding.
When I worked on Acrobat 1.0, I was responsible for the "Find..." command which included your problem as a subset. With a richer set of tools and more expertise, it took a couple months to get it right.