I use a itext for converting pdf to text file, it works good actually but for some words it do the following thing: for example in pdf there is phrase like \"present the mai
you can use jasper reports. It works like a charm
To expand on the brilliant explanation by mkl, here is a detail for a specific variation of the issue presented in the question. I stumbled upon a document from which I wanted to extract text. Every letter came out seperated by a space.
text would read as "t e x t"
I tried implementing my own extraction strategy class as outlined by mkl. Whichever factor I tried to apply to the "single space width" value, the text came out the same way as before. So I debugged my code to see the width value itself and it turned out to be 0.
To circumvent that you can use a fix value in the code outlined by mkl:
float spacing = lastEnd.subtract(start).length();
if (spacing > someFixValue)
{
result.append(' ');
}
if you base your own extraction strategy on LocationTextExtractionStrategy, the method you want to override is: IsChunkAtWordBoundary(...)
The reason for such missing space characters is that the space you see in the rendered PDF does not necessarily correspond to a space character in the page content description of the PDF. Instead you often find an operation in PDFs which after rendering one word moves the current position slightly to the right before rendering the next word.
Unfortunately the same mechanism also is used to enhance the appearance of adjacent glyphs: In some letter combinations, for a good appearance and reading experience the glyphs should be printed nearer to each other or farther from each other than they would be by default. This is done in PDFs using the same operation as above.
Thus, a PDF parser in such situations has to use heuristics to decide whether such a shift was meant to imply a space character or whether it was merely meant to make the letter group look good. And heuristics can fail.
You useSimpleTextExtractionStrategy
as text extraction strategy. The heuristics in this case are implemented like this (as currently in therenderText
method in SimpleTextExtractionStrategy.java in the iText SVN trunk):
float spacing = lastEnd.subtract(start).length();
if (spacing > renderInfo.getSingleSpaceWidth()/2f)
{
result.append(' ');
}
Thus, a gap which is at least half as wide as the current width of as space character, is translated into a space character.
This generally sounds sensible. In case of documents, though, which only use horizontal shifts to separate words, the current widths of the actual space character may not be a good measure for the heuristics.
So, what you can do is try to improve the heuristics in the text extraction strategy. Copy the existing one, manipulate it, and use it in your code.
If you supply a sample PDF for your issue, we might have some ideas to help.