Which is the right method to text extraction strategy?

前端未结

关注

 1  1975

故里飘歌 2021-01-26 07:57

In my c# code I am extracting text from a pdf, and I have two methods of doing it. However 1 method works for 1 type of pdf document and the other method works for the other typ

1条回答

不思量自难忘° (楼主)

2021-01-26 08:46

Concerning Method 1: The PdfParser from that codeproject article only works in special situations.

It assumes that all the text content is contained in the immediate page content stream(s). Actually these streams may include references to resources which themselves contain text. This especially is often the case for n-up documents but it may happen in any document.

Furthermore it assumes some Latin1-like character encoding. This is often the case for text in European languages (merely often!) but in case of many Asian languages this hardly ever results in something sensible.

Additionally it interprets all kerning gaps as space characters.

Concerning Method 2: As mentioned in a comment to your former question How to extract text from a PDF and decode characters? you might want to have a look at this answer to a similar problem.

Essentially the reason for such missing space characters is that the space you see in the rendered PDF does not necessarily correspond to a space character in the page content description of the PDF. Instead you often find an operation in PDFs which after rendering one word moves the current position slightly to the right before rendering the next word.

Unfortunately the same mechanism also is used to enhance the appearance of adjacent glyphs: In some letter combinations, for a good appearance and reading experience the glyphs should be printed nearer to each other or farther from each other than they would be by default. This is done in PDFs using the same operation as above.

Thus, a PDF parser in such situations has to use heuristics to decide whether such a shift was meant to imply a space character or whether it was merely meant to make the letter group look good. And heuristics can fail.

The answer referenced indicates how to tweak these heuristics, and the original poster of the question accordingly found a good solution for parsing his PDFs successfully.

If you want some final solution to your problem, you had better supply sample PDFs in which you observed that issue.

0 讨论(0)
发布评论:

提交评论
- 加载中...