itext java pdf to text creation

大憨熊 提交于 2019-11-27 09:50:54

The reason for such missing space characters is that the space you see in the rendered PDF does not necessarily correspond to a space character in the page content description of the PDF. Instead you often find an operation in PDFs which after rendering one word moves the current position slightly to the right before rendering the next word.

Unfortunately the same mechanism also is used to enhance the appearance of adjacent glyphs: In some letter combinations, for a good appearance and reading experience the glyphs should be printed nearer to each other or farther from each other than they would be by default. This is done in PDFs using the same operation as above.

Thus, a PDF parser in such situations has to use heuristics to decide whether such a shift was meant to imply a space character or whether it was merely meant to make the letter group look good. And heuristics can fail.

You useSimpleTextExtractionStrategyas text extraction strategy. The heuristics in this case are implemented like this (as currently in therenderTextmethod in SimpleTextExtractionStrategy.java in the iText SVN trunk):

float spacing = lastEnd.subtract(start).length();
if (spacing > renderInfo.getSingleSpaceWidth()/2f)
{
    result.append(' ');
}

Thus, a gap which is at least half as wide as the current width of as space character, is translated into a space character.

This generally sounds sensible. In case of documents, though, which only use horizontal shifts to separate words, the current widths of the actual space character may not be a good measure for the heuristics.

So, what you can do is try to improve the heuristics in the text extraction strategy. Copy the existing one, manipulate it, and use it in your code.

If you supply a sample PDF for your issue, we might have some ideas to help.

you can use jasper reports. It works like a charm

To expand on the brilliant explanation by mkl, here is a detail for a specific variation of the issue presented in the question. I stumbled upon a document from which I wanted to extract text. Every letter came out seperated by a space.

text would read as "t e x t"

I tried implementing my own extraction strategy class as outlined by mkl. Whichever factor I tried to apply to the "single space width" value, the text came out the same way as before. So I debugged my code to see the width value itself and it turned out to be 0.

To circumvent that you can use a fix value in the code outlined by mkl:

float spacing = lastEnd.subtract(start).length();
if (spacing > someFixValue)
{
    result.append(' ');
}

if you base your own extraction strategy on LocationTextExtractionStrategy, the method you want to override is: IsChunkAtWordBoundary(...)

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!