How would you get an array of Unicode code points from a .NET String?

I have a list of character range restrictions that I need to check a string against, but the char type in .NET is UTF-16 and therefore some characters become wacky (surrogate) pairs instead. Thus when enumerating all the char's in a string, I don't get the 32-bit Unicode code points and some comparisons with high values fail.

I understand Unicode well enough that I could parse the bytes myself if necessary, but I'm looking for a C#/.NET Framework BCL solution. So ...

How would you convert a string to an array (int[]) of 32-bit Unicode code points?

This answer is not correct. See @Virtlink's answer for the correct one.

static int[] ExtractScalars(string s)
{
  if (!s.IsNormalized())
  {
    s = s.Normalize();
  }

  List<int> chars = new List<int>((s.Length * 3) / 2);

  var ee = StringInfo.GetTextElementEnumerator(s);

  while (ee.MoveNext())
  {
    string e = ee.GetTextElement();
    chars.Add(char.ConvertToUtf32(e, 0));
  }

  return chars.ToArray();
}

Notes: Normalization is required to deal with composite characters.

Daniel A.A. Pelsmaeker

You are asking about code points. In UTF-16 (C#'s char) there are only two possibilities:

The character is from the Basic Multilingual Plane, and is encoded by a single code unit.
The character is outside the BMP, and encoded using a surrogare high-low pair of code units

Therefore, assuming the string is valid, this returns an array of code points for a given string:

public static int[] ToCodePoints(string str)
{
    if (str == null)
        throw new ArgumentNullException("str");

    var codePoints = new List<int>(str.Length);
    for (int i = 0; i < str.Length; i++)
    {
        codePoints.Add(Char.ConvertToUtf32(str, i));
        if (Char.IsHighSurrogate(str[i]))
            i += 1;
    }

    return codePoints.ToArray();
}

An example with a surrogate pair 🌀 and a composed character ñ:

ToCodePoints("\U0001F300 El Ni\u006E\u0303o");                        // 🌀 El Niño
// { 0x1f300, 0x20, 0x45, 0x6c, 0x20, 0x4e, 0x69, 0x6e, 0x303, 0x6f } // 🌀   E l   N i n ̃◌ o

Here's another example. These two code points represents a 32th musical note with a staccato accent, both surrogate pairs:

ToCodePoints("\U0001D162\U0001D181");              // 𝅘𝅥𝅰𝆁
// { 0x1d162, 0x1d181 }                            // 𝅘𝅥𝅰 𝆁◌

When C-normalized, they are decomposed into a notehead, combining stem, combining flag and combining accent-staccato, all surrogate pairs:

ToCodePoints("\U0001D162\U0001D181".Normalize());  // 𝅘𝅥𝅰𝆁
// { 0x1d158, 0x1d165, 0x1d170, 0x1d181 }          // 𝅘 𝅥 𝅰 𝆁◌

Note that leppie's solution is not correct. The question is about code points, not text elements. A text element is a combination of code points that together form a single grapheme. For example, in the example above, the ñ in the string is represented by a Latin lowercase n followed by a combining tilde ̃◌. Leppie's solution discards any combining characters that cannot be normalized into a single code point.

Doesn't seem like it should be much more complicated than this:

public static IEnumerable<int> Utf32CodePoints( this IEnumerable<char> s )
{
  bool      useBigEndian = !BitConverter.IsLittleEndian;
  Encoding  utf32        = new UTF32Encoding( useBigEndian , false , true ) ;
  byte[]    octets       = utf32.GetBytes( s ) ;

  for ( int i = 0 ; i < octets.Length ; i+=4 )
  {
    int codePoint = BitConverter.ToInt32(octets,i);
    yield return codePoint;
  }

}

Rich Armstrong

I came up with the same approach suggested by Nicholas (and Jeppe), just shorter:

    public static IEnumerable<int> GetCodePoints(this string s) {
        var utf32 = new UTF32Encoding(!BitConverter.IsLittleEndian, false, true);
        var bytes = utf32.GetBytes(s);
        return Enumerable.Range(0, bytes.Length / 4).Select(i => BitConverter.ToInt32(bytes, i * 4));
    }

The enumeration was all I needed, but getting an array is trivial:

int[] codePoints = myString.GetCodePoints().ToArray();

来源：https://stackoverflow.com/questions/687359/how-would-you-get-an-array-of-unicode-code-points-from-a-net-string

标签

string

unicode

char

astral-plane