Why is the length of this string longer than the number of characters in it?

后端 未结 8 1025
隐瞒了意图╮
隐瞒了意图╮ 2020-11-29 19:26

This code:

string a = \"abc\";
string b = \"A         


        
相关标签:
8条回答
  • 2020-11-29 20:21

    Okay, in .Net and C# all strings are encoded as UTF-16LE. A string is stored as a sequence of chars. Each char encapsulates the storage of 2 bytes or 16 bits.

    What we see "on paper or screen" as a single letter, character, glyph, symbol, or punctuation mark can be thought of as a single Text Element. As described in Unicode Standard Annex #29 UNICODE TEXT SEGMENTATION, each Text Element is represented by one or more Code Points. An exhaustive list of Codes can be found here.

    Each Code Point needs to encoded into binary for internal representation by a computer. As stated, each char stores 2 bytes. Code Points at or below U+FFFF can be stored in a single char. Code Points above U+FFFF are stored as a surrogate pair, using two chars to represent a single Code Point.

    Given what we now know we can deduce, a Text Element can be stored as one char, as a Surrogate Pair of two chars or, if the Text Element is represented by multiple Code Points some combination of single chars and Surrogate Pairs. As if that weren't complicated enough, some Text Elements can be represented by different combinations of Code Points as described in, Unicode Standard Annex #15, UNICODE NORMALIZATION FORMS.


    Interlude

    So, strings that look the same when rendered can actually be made up of a different combination of chars. An ordinal (byte by byte) comparison of two such strings would detect a difference, this may be unexpected or undesirable.

    You can re-encode .Net strings. so that they use the same Normalization Form. Once normalized, two strings with the same Text Elements will be encoded the same way. To do this, use the string.Normalize function. However, remember, some different Text Elements look similar to each other. :-s


    So, what does this all mean in relation to the question? The Text Element '

    0 讨论(0)
  • 2020-11-29 20:22

    That is because the Length property returns the number of char objects, not the number of unicode characters. In your case, one of the Unicode characters is represented by more than one char object (SurrogatePair).

    The Length property returns the number of Char objects in this instance, not the number of Unicode characters. The reason is that a Unicode character might be represented by more than one Char. Use the System.Globalization.StringInfo class to work with each Unicode character instead of each Char.

    0 讨论(0)
提交回复
热议问题