I\'m using this code to generate U+10FFFC
var s = Encoding.UTF8.GetString(new byte[] {0xF4,0x8F,0xBF,0xBC});
I know it\'s for
Yet another alternative to enumerate the UTF32 characters in a C# string is to use the System.Globalization.StringInfo.GetTextElementEnumerator
method, as in the code below.
public static class StringExtensions
{
public static System.Collections.Generic.IEnumerable<UTF32Char> GetUTF32Chars(this string s)
{
var tee = System.Globalization.StringInfo.GetTextElementEnumerator(s);
while (tee.MoveNext())
{
yield return new UTF32Char(s, tee.ElementIndex);
}
}
}
public struct UTF32Char
{
private string s;
private int index;
public UTF32Char(string s, int index)
{
this.s = s;
this.index = index;
}
public override string ToString()
{
return char.ConvertFromUtf32(this.UTF32Code);
}
public int UTF32Code { get { return char.ConvertToUtf32(s, index); } }
public double NumericValue { get { return char.GetNumericValue(s, index); } }
public UnicodeCategory UnicodeCategory { get { return char.GetUnicodeCategory(s, index); } }
public bool IsControl { get { return char.IsControl(s, index); } }
public bool IsDigit { get { return char.IsDigit(s, index); } }
public bool IsLetter { get { return char.IsLetter(s, index); } }
public bool IsLetterOrDigit { get { return char.IsLetterOrDigit(s, index); } }
public bool IsLower { get { return char.IsLower(s, index); } }
public bool IsNumber { get { return char.IsNumber(s, index); } }
public bool IsPunctuation { get { return char.IsPunctuation(s, index); } }
public bool IsSeparator { get { return char.IsSeparator(s, index); } }
public bool IsSurrogatePair { get { return char.IsSurrogatePair(s, index); } }
public bool IsSymbol { get { return char.IsSymbol(s, index); } }
public bool IsUpper { get { return char.IsUpper(s, index); } }
public bool IsWhiteSpace { get { return char.IsWhiteSpace(s, index); } }
}
While @R. Martinho Fernandes's answer is correct, his AsCodePoints
extension method has two issues:
ArgumentException
on invalid code points (high surrogate without low surrogate or vice versa).char
static methods that take (char)
or (string, int)
(such as char.IsNumber()
) if you only have int code points.I've split the code into two methods, one similar to the original but returns the Unicode Replacement Character on invalid code points. The second method returns a struct IEnumerable with more useful fields:
StringCodePointExtensions.cs
public static class StringCodePointExtensions {
const char ReplacementCharacter = '\ufffd';
public static IEnumerable<CodePointIndex> CodePointIndexes(this string s) {
for (int i = 0; i < s.Length; i++) {
if (char.IsHighSurrogate(s, i)) {
if (i + 1 < s.Length && char.IsLowSurrogate(s, i + 1)) {
yield return CodePointIndex.Create(i, true, true);
i++;
continue;
} else {
// High surrogate without low surrogate
yield return CodePointIndex.Create(i, false, false);
continue;
}
} else if (char.IsLowSurrogate(s, i)) {
// Low surrogate without high surrogate
yield return CodePointIndex.Create(i, false, false);
continue;
}
yield return CodePointIndex.Create(i, true, false);
}
}
public static IEnumerable<int> CodePointInts(this string s) {
return s
.CodePointIndexes()
.Select(
cpi => {
if (cpi.Valid) {
return char.ConvertToUtf32(s, cpi.Index);
} else {
return (int)ReplacementCharacter;
}
});
}
}
CodePointIndex.cs
:
public struct CodePointIndex {
public int Index;
public bool Valid;
public bool IsSurrogatePair;
public static CodePointIndex Create(int index, bool valid, bool isSurrogatePair) {
return new CodePointIndex {
Index = index,
Valid = valid,
IsSurrogatePair = isSurrogatePair,
};
}
}
To the extent possible under law, the person who associated CC0 with this work has waived all copyright and related or neighboring rights to this work.
As posted already by Martinho, it is much easier to create the string with this private codepoint that way:
var s = char.ConvertFromUtf32(0x10FFFC);
But to loop through the two char elements of that string is senseless:
foreach(var ch in s)
{
Console.WriteLine(ch);
}
What for? You will just get the high and low surrogate that encode the codepoint. Remember a char is a 16 bit type so it can hold just a max value of 0xFFFF. Your codepoint doesn't fit into a 16 bit type, indeed for the highest codepoint you'll need 21 bits (0x10FFFF) so the next wider type would just be a 32 bit type. The two char elements are not characters, but a surrogate pair. The value of 0x10FFFC is encoded into the two surrogates.
U+10FFFC is one Unicode code point, but string
's interface does not expose a sequence of Unicode code points directly. Its interface exposes a sequence of UTF-16 code units. That is a very low-level view of text. It is quite unfortunate that such a low-level view of text was grafted onto the most obvious and intuitive interface available... I'll try not to rant much about how I don't like this design, and just say that not matter how unfortunate, it is just a (sad) fact you have to live with.
First off, I will suggest using char.ConvertFromUtf32 to get your initial string. Much simpler, much more readable:
var s = char.ConvertFromUtf32(0x10FFFC);
So, this string's Length
is not 1, because, as I said, the interface deals in UTF-16 code units, not Unicode code points. U+10FFFC uses two UTF-16 code units, so s.Length
is 2. All code points above U+FFFF require two UTF-16 code units for their representation.
You should note that ConvertFromUtf32
doesn't return a char
: char
is a UTF-16 code unit, not a Unicode code point. To be able to return all Unicode code points, that method cannot return a single char
. Sometimes it needs to return two, and that's why it makes it a string. Sometimes you will find some APIs dealing in int
s instead of char
because int
can be used to handle all code points too (that's what ConvertFromUtf32
takes as argument, and what ConvertToUtf32
produces as result).
string
implements IEnumerable<char>
, which means that when you iterate over a string
you get one UTF-16 code unit per iteration. That's why iterating your string and printing it out yields some broken output with two "things" in it. Those are the two UTF-16 code units that make up the representation of U+10FFFC. They are called "surrogates". The first one is a high/lead surrogate and the second one is a low/trail surrogate. When you print them individually they do not produce meaningful output because lone surrogates are not even valid in UTF-16, and they are not considered Unicode characters either.
When you append those two surrogates to the string in the loop, you effectively reconstruct the surrogate pair, and printing that pair later as one gets you the right output.
And in the ranting front, note how nothing complains that you used a malformed UTF-16 sequence in that loop. It creates a string with a lone surrogate, and yet everything carries on as if nothing happened: the string
type is not even the type of well-formed UTF-16 code unit sequences, but the type of any UTF-16 code unit sequence.
The char structure provides static methods to deal with surrogates: IsHighSurrogate
, IsLowSurrogate
, IsSurrogatePair
, ConvertToUtf32
, and ConvertFromUtf32
. If you want you can write an iterator that iterates over Unicode characters instead of UTF-16 code units:
static IEnumerable<int> AsCodePoints(this string s)
{
for(int i = 0; i < s.Length; ++i)
{
yield return char.ConvertToUtf32(s, i);
if(char.IsHighSurrogate(s, i))
i++;
}
}
Then you can iterate like:
foreach(int codePoint in s.AsCodePoints())
{
// do stuff. codePoint will be an int will value 0x10FFFC in your example
}
If you prefer to get each code point as a string instead change the return type to IEnumerable<string>
and the yield line to:
yield return char.ConvertFromUtf32(char.ConvertToUtf32(s, i));
With that version, the following works as-is:
foreach(string codePoint in s.AsCodePoints())
{
Console.WriteLine(codePoint);
}