I saw this post on Jon Skeet\'s blog where he talks about string reversing. I wanted to try the example he showed myself, but it seems to work... which leads me to believe that
The term "surrogate pair" refers to a means of encoding Unicode characters with high code-points in the UTF-16
encoding scheme (see this page for more information);
In the Unicode
character encoding, characters are mapped to values between 0x000000
and 0x10FFFF
. Internally, a UTF-16
encoding scheme is used to store strings of Unicode
text in which two-byte (16-bit
) code sequences are considered. Since two bytes can only contain the range of characters from 0x0000
to 0xFFFF
, some additional complexity is used to store values above this range (0x010000
to 0x10FFFF
).
This is done using pairs of code points known as surrogates. The surrogate characters are classified in two distinct ranges known as low surrogates
and high surrogates
, depending on whether they are allowed at the start or the end of the two-code sequence.
Try this yourself:
String surrogate = "abc" + Char.ConvertFromUtf32(Int32.Parse("2A601", NumberStyles.HexNumber)) + "def";
Char[] surrogateArray = surrogate.ToCharArray();
Array.Reverse(surrogateArray);
String surrogateReversed = new String(surrogateArray);
or this, if you want to stick with the blog example:
String surrogate = "Les Mise" + Char.ConvertFromUtf32(Int32.Parse("0301", NumberStyles.HexNumber)) + "rables";
Char[] surrogateArray = surrogate.ToCharArray();
Array.Reverse(surrogateArray);
String surrogateReversed = new String(surrogateArray);
nnd then check the string values with the debugger. Jon Skeet is damn right... strings and dates seem easy but they are absolutely NOT.
The simplest way is to use \U########
where the U
is capital, and the #
denote exactly eight hexadecimal digits. If the value exceeds 0000FFFF
hexadecimal, a surrogate pair will be needed:
string myString = "In the game of mahjong \U0001F01C denotes the Four of circles";
You can check myString.Length
to see that the one Unicode character occupies two .NET Char
values. Note that the char
type has a couple of static
methods that will help you determine if a char
is a part of a surrogate pair.
If you use a .NET language that does not have something like the \U########
escape sequence, you can use the method ConvertFromUtf32
, for example:
string fourCircles = char.ConvertFromUtf32(0x1F01C);
Addition: If your C# source file has an encoding that allows all Unicode characters, like UTF-8, you can just put the charater directly in the file (by copy-paste). For example:
string myString = "In the game of mahjong