How to fix UTF encoding for whitespaces?

后端 未结 3 1378
感动是毒
感动是毒 2020-12-16 14:10

In my C# code, I am extracting text from a PDF document. When I do that, I get a string that\'s in UTF-8 or Unicode encoding (I\'m not sure which). When I use Encoding

相关标签:
3条回答
  • 2020-12-16 14:26

    Interpreting \xC2\xA0 (=194, 160) as UTF8 actually yields \xA0 which is unicode non-breaking space. This is a different character than ordinary space and thus, doesn't match ordinary spaces. You have to match against the non-breaking space or use fuzzy-matching against any whitespace.

    0 讨论(0)
  • 2020-12-16 14:32

    194 160 is the UTF-8 encoding of a NO-BREAK SPACE codepoint (the same codepoint that HTML calls  ).

    So it's really not a space, even though it looks like one. (You'll see it won't word-wrap, for instance.) A regular expression match for \s would match it, but a plain comparison with a space won't.

    To simply replace NO-BREAK spaces you can do the following:

    src = src.Replace('\u00A0', ' ');
    
    0 讨论(0)
  • 2020-12-16 14:34

    In UTF8 character value c2 a0 (194 160) is defined as NO-BREAK SPACE. According to ISO/IEC 8859 this is a space that does not allow a line break to be inserted. Normally text processing software assumes that a line break can be inserted at any white space character (this is how word wrap is normally implemented). You should be able to simply do a replace in your string of the characters with a normal space to fix the problem.

    0 讨论(0)
提交回复
热议问题