问题
I have written this method to reverse a string
public string Reverse(string s)
{
if(string.IsNullOrEmpty(s))
return s;
TextElementEnumerator enumerator =
StringInfo.GetTextElementEnumerator(s);
var elements = new List<char>();
while (enumerator.MoveNext())
{
var cs = enumerator.GetTextElement().ToCharArray();
if (cs.Length > 1)
{
elements.AddRange(cs.Reverse());
}
else
{
elements.AddRange(cs);
}
}
elements.Reverse();
return string.Concat(elements);
}
Now, I don't want to start a discussion about how this code could be made more efficient or how there are one liners that I could use instead. I'm aware that you can perform Xors and all sorts of other things to potentially improve this code. If I want to refactor the code later I could do that easily as I have unit tests.
Currently, this correctly reverses BML strings (including strings with accents like "Les Misérables"
) and strings that contain combined characters such as "Les Mise\u0301rables"
.
My test that contains surrogate pairs work if they are expressed like this
Assert.AreEqual("𠈓", _stringOperations.Reverse("𠈓"));
But if I express surrogate pairs like this
Assert.AreEqual("\u10000", _stringOperations.Reverse("\u10000"));
then the test fails. Is there an air-tight implementation that supports surrogate pairs as well?
If I have made any mistake above then please do point this out as I'm no Unicode expert.
回答1:
\u10000
is a string of two characters: က (Unicode code point 1000) followed by a 0
(which can be detected by inspecting the value of s
in your method). If you reverse two characters, they won't match the input anymore.
It seems you're after Unicode Character 'LINEAR B SYLLABLE B008 A' (U+10000) with hexadecimal code point 10000. From Unicode character escape sequences on MSDN:
\u hex-digit hex-digit hex-digit hex-digit
\U hex-digit hex-digit hex-digit hex-digit hex-digit hex-digit hex-digit hex-digit
So you'll have to use either four or eight digits.
Use \U00010000
(notice the capital U) or \uD800\uDC00
instead of \u10000
.
回答2:
Necromancing.
This happens because you use List<char>.Reverse
instead of List<string>.Reverse
// using System.Globalization;
TextElementEnumerator enumerator =
StringInfo.GetTextElementEnumerator("Les Mise\u0301rables");
List<string> elements = new List<string>();
while (enumerator.MoveNext())
elements.Add(enumerator.GetTextElement());
elements.Reverse();
string reversed = string.Concat(elements); // selbarésiM seL
See Jon Skeet's pony video for more information: https://vimeo.com/7403673
Here's how you properly reverse a string (a string, not a sequence of chars):
public static class Test
{
private static System.Collections.Generic.List<string> GraphemeClusters(string s)
{
System.Collections.Generic.List<string> ls = new System.Collections.Generic.List<string>();
System.Globalization.TextElementEnumerator enumerator = System.Globalization.StringInfo.GetTextElementEnumerator(s);
while (enumerator.MoveNext())
{
ls.Add((string)enumerator.Current);
}
return ls;
}
// this
private static string ReverseGraphemeClusters(string s)
{
if(string.IsNullOrEmpty(s) || s.Length == 1)
return s;
System.Collections.Generic.List<string> ls = GraphemeClusters(s);
ls.Reverse();
return string.Join("", ls.ToArray());
}
public static void TestMe()
{
string s = "Les Mise\u0301rables";
string r = ReverseGraphemeClusters(s);
// This would be wrong:
// char[] a = s.ToCharArray();
// System.Array.Reverse(a);
// string r = new string(a);
System.Console.WriteLine(r);
}
}
Note that you need to know the difference between
- a character and a glyph
- a byte (8 bit) and a codepoint/rune (32 bit)
- a codepoint and a GraphemeCluster [32+ bit] (aka Grapheme/Glyph)
Reference:
Character is an overloaded term than can mean many things.
A code point is the atomic unit of information. Text is a sequence of code points. Each code point is a number which is given meaning by the Unicode standard.
A grapheme is a sequence of one or more code points that are displayed as a single, graphical unit that a reader recognizes as a single element of the writing system. For example, both a and ä are graphemes, but they may consist of multiple code points (e.g. ä may be two code points, one for the base character a followed by one for the diaresis; but there's also an alternative, legacy, single code point representing this grapheme). Some code points are never part of any grapheme (e.g. the zero-width non-joiner, or directional overrides).
A glyph is an image, usually stored in a font (which is a collection of glyphs), used to represent graphemes or parts thereof. Fonts may compose multiple glyphs into a single representation, for example, if the above ä is a single code point, a font may chose to render that as two separate, spatially overlaid glyphs. For OTF, the font's GSUB and GPOS tables contain substitution and positioning information to make this work. A font may contain multiple alternative glyphs for the same grapheme, too.
回答3:
This is a start. It might not be the fastest, but it does seem to work for what we have thrown at it.
internal static string ReverseItWithSurrogate(string stringToReverse)
{
string result = string.Empty;
// We want to get the string into a character array first
char[] stringArray = stringToReverse.ToCharArray();
// This is the object that will hold our reversed string.
var sb = new StringBuilder();
bool haveSurrogate = false;
// We are starting at the back and looking at each character. if it is a
// low surrogate and the one prior is a high and not < 0, then we have a surrogate pair.
for (int loopVariable = stringArray.Length - 1; loopVariable >= 0; loopVariable--)
{
// we cant' check the high surrogate if the low surrogate is index 0
if (loopVariable > 0)
{
haveSurrogate = false;
if (char.IsLowSurrogate(stringArray[loopVariable]) && char.IsHighSurrogate(stringArray[loopVariable - 1]))
{
sb.Append(stringArray[loopVariable - 1]);
sb.Append(stringArray[loopVariable]);
// and force the second character to drop from our loop
loopVariable--;
haveSurrogate = true;
}
if (!haveSurrogate)
{
sb.Append(stringArray[loopVariable]);
}
}
else
{
// Now we have to handle the first item in the list if it is not a high surrogate.
if (!haveSurrogate)
{
sb.Append(stringArray[loopVariable]);
}
}
}
result = sb.ToString();
return result;
}
回答4:
best viewed NOT in Chrome!
using System.Linq;
using System.Collections.Generic;
using System;
using System.Globalization;
using System.Diagnostics;
using System.Collections;
namespace OrisNumbers
{
public static class IEnumeratorExtensions
{
public static IEnumerable<T> AsIEnumerable<T>(this IEnumerator iterator)
{
while (iterator.MoveNext())
{
yield return (T)iterator.Current;
}
}
}
class Program
{
static void Main(string[] args)
{
var s = "foo 𝌆 bar mañana mañana" ;
Debug.WriteLine(s);
Debug.WriteLine(string.Join("", StringInfo.GetTextElementEnumerator(s.Normalize()).AsIEnumerable<string>().Reverse()));
Console.Read();
}
}
}
来源:https://stackoverflow.com/questions/22114707/how-to-reverse-a-string-that-contains-surrogate-pairs