Empty whitespace conversion in PDFClown

送分小仙女□ 提交于 2019-12-12 01:15:55

问题


I'm having an issue when using the TextExtractor class in PDFClown, with occurrences of empty whitespace also known as a "discretionary newline". These characters are embedded randomly but ignored in Acrobat Reader. So, lines where these characters exist will show as a single line in Acrobat, but are broken into many lines when the text is extracted, if I specify '\n' as the newline character in TextExtractor.ToString(...).

It appears that PDF clown simply takes any whitespace character and converts it into a single space, or ' '. Is there a way to bypass this conversion, so that the original character is extracted instead?


回答1:


After more research, it appears that the PDFClown library is very buggy. There are several issues:

  • Converts most forms of space character to a single normal space character.
  • Inserts spaces instead of newlines.
  • If you attempt to use the provided overrides to insert your own character for spaces or newlines, the internal mappings of characters in the extracted array to boxes for each individual character gets destroyed.
  • Cannot properly decode all embedded fonts.
  • Since it cannot properly decode embedded fonts, it will silently omit characters from extracted text.
  • Cannot reliably handle ligatures or decomposition of ligatures. Often silently dropped altogether from extracted text.

To come directly to the issue I had, you can detect and remove these "false" whitespace characters by checking their bounding rectangle to see if they overlap other non-whitespace characters, but given all the other issues with the library, my advice to use use PDFBox instead.

If you're using .NET and you'd like to use PDFBox, you can use Tika On Dot Net which is the Apache Tika project brought over to .NET via IKVM.

Apache Tika is a collection of other libraries, include PDFBox. Tika On Dot Net currently has PDFBox 1.8.10 and also has a Nuget package to make adding to your project easy.

I had a project go 1.5 weeks over deadline because all of these issues were discovered half way through, which required a full rewrite. Just a heads up.



来源:https://stackoverflow.com/questions/36896762/empty-whitespace-conversion-in-pdfclown

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!