How to reverse a string that contains surrogate pairs

问题

I have written this method to reverse a string

public string Reverse(string s)
        {
            if(string.IsNullOrEmpty(s)) 
                return s;

            TextElementEnumerator enumerator =
               StringInfo.GetTextElementEnumerator(s);

            var elements = new List<char>();
            while (enumerator.MoveNext())
            {
                var cs = enumerator.GetTextElement().ToCharArray();
                if (cs.Length > 1)
                {
                    elements.AddRange(cs.Reverse());
                }
                else
                {
                    elements.AddRange(cs);
                }
            }

            elements.Reverse();
            return string.Concat(elements);
        }

Now, I don't want to start a discussion about how this code could be made more efficient or how there are one liners that I could use instead. I'm aware that you can perform Xors and all sorts of other things to potentially improve this code. If I want to refactor the code later I could do that easily as I have unit tests.

Currently, this correctly reverses BML strings (including strings with accents like "Les Misérables") and strings that contain combined characters such as "Les Mise\u0301rables".

My test that contains surrogate pairs work if they are expressed like this

Assert.AreEqual("𠈓", _stringOperations.Reverse("𠈓"));

But if I express surrogate pairs like this

Assert.AreEqual("\u10000", _stringOperations.Reverse("\u10000"));

then the test fails. Is there an air-tight implementation that supports surrogate pairs as well?

If I have made any mistake above then please do point this out as I'm no Unicode expert.

回答1:

\u10000 is a string of two characters: က (Unicode code point 1000) followed by a 0 (which can be detected by inspecting the value of s in your method). If you reverse two characters, they won't match the input anymore.

It seems you're after Unicode Character 'LINEAR B SYLLABLE B008 A' (U+10000) with hexadecimal code point 10000. From Unicode character escape sequences on MSDN:

\u hex-digit hex-digit hex-digit hex-digit

\U hex-digit hex-digit hex-digit hex-digit hex-digit hex-digit hex-digit hex-digit

So you'll have to use either four or eight digits.

Use \U00010000 (notice the capital U) or \uD800\uDC00 instead of \u10000.

回答2:

Necromancing.
This happens because you use List<char>.Reverse instead of List<string>.Reverse

// using System.Globalization;

TextElementEnumerator enumerator =
    StringInfo.GetTextElementEnumerator("Les Mise\u0301rables");

List<string> elements = new List<string>();
while (enumerator.MoveNext())
    elements.Add(enumerator.GetTextElement());

elements.Reverse();
string reversed = string.Concat(elements);  // selbarésiM seL

See Jon Skeet's pony video for more information: https://vimeo.com/7403673

Here's how you properly reverse a string (a string, not a sequence of chars):

public static class Test
{

    private static System.Collections.Generic.List<string> GraphemeClusters(string s)
    {
        System.Collections.Generic.List<string> ls = new System.Collections.Generic.List<string>();

        System.Globalization.TextElementEnumerator enumerator = System.Globalization.StringInfo.GetTextElementEnumerator(s);
        while (enumerator.MoveNext())
        {
            ls.Add((string)enumerator.Current);
        }

        return ls;
    }


    // this 
    private static string ReverseGraphemeClusters(string s)
    {
         if(string.IsNullOrEmpty(s) || s.Length == 1)
              return s;

        System.Collections.Generic.List<string> ls = GraphemeClusters(s);
        ls.Reverse();

        return string.Join("", ls.ToArray());
    }

    public static void TestMe()
    {
        string s = "Les Mise\u0301rables";
        string r = ReverseGraphemeClusters(s);

        // This would be wrong:
        // char[] a = s.ToCharArray();
        // System.Array.Reverse(a);
        // string r = new string(a);

        System.Console.WriteLine(r);
    }
}

Note that you need to know the difference between
- a character and a glyph
- a byte (8 bit) and a codepoint/rune (32 bit)
- a codepoint and a GraphemeCluster [32+ bit] (aka Grapheme/Glyph)

Reference:

Character is an overloaded term than can mean many things.

A code point is the atomic unit of information. Text is a sequence of code points. Each code point is a number which is given meaning by the Unicode standard.

A grapheme is a sequence of one or more code points that are displayed as a single, graphical unit that a reader recognizes as a single element of the writing system. For example, both a and ä are graphemes, but they may consist of multiple code points (e.g. ä may be two code points, one for the base character a followed by one for the diaresis; but there's also an alternative, legacy, single code point representing this grapheme). Some code points are never part of any grapheme (e.g. the zero-width non-joiner, or directional overrides).

A glyph is an image, usually stored in a font (which is a collection of glyphs), used to represent graphemes or parts thereof. Fonts may compose multiple glyphs into a single representation, for example, if the above ä is a single code point, a font may chose to render that as two separate, spatially overlaid glyphs. For OTF, the font's GSUB and GPOS tables contain substitution and positioning information to make this work. A font may contain multiple alternative glyphs for the same grapheme, too.

回答3:

This is a start. It might not be the fastest, but it does seem to work for what we have thrown at it.

internal static string ReverseItWithSurrogate(string stringToReverse)
{
    string result = string.Empty;

    // We want to get the string into a character array first
    char[] stringArray = stringToReverse.ToCharArray();

    // This is the object that will hold our reversed string.
    var sb = new StringBuilder();
    bool haveSurrogate = false;

    // We are starting at the back and looking at each character.  if it is a
    // low surrogate and the one prior is a high and not < 0, then we have a surrogate pair.
    for (int loopVariable = stringArray.Length - 1; loopVariable >= 0; loopVariable--)
    {
    // we cant' check the high surrogate if the low surrogate is index 0
    if (loopVariable > 0)
    {
        haveSurrogate = false;

        if (char.IsLowSurrogate(stringArray[loopVariable]) &&    char.IsHighSurrogate(stringArray[loopVariable - 1]))
       {
          sb.Append(stringArray[loopVariable - 1]);
          sb.Append(stringArray[loopVariable]);

         // and force the second character to drop from our loop
         loopVariable--;
         haveSurrogate = true;
       }

      if (!haveSurrogate)
      {
         sb.Append(stringArray[loopVariable]);
        }
       }
    else
    {
     // Now we have to handle the first item in the list if it is not a high surrogate.
      if (!haveSurrogate)
      {
        sb.Append(stringArray[loopVariable]);
       }
     }
   }

result = sb.ToString();
return result;
}

回答4:

best viewed NOT in Chrome!

using System.Linq;
using System.Collections.Generic;
using System;
using System.Globalization;
using System.Diagnostics;
using System.Collections;
namespace OrisNumbers
{
    public static class IEnumeratorExtensions
    {
        public static IEnumerable<T> AsIEnumerable<T>(this IEnumerator iterator)
        {
            while (iterator.MoveNext())
            {
                yield return (T)iterator.Current;
            }
        }
    }
    class Program
    {
        static void Main(string[] args)
        {
            var s = "foo 𝌆 bar mañana mañana" ;
            Debug.WriteLine(s);
            Debug.WriteLine(string.Join("", StringInfo.GetTextElementEnumerator(s.Normalize()).AsIEnumerable<string>().Reverse()));
            Console.Read();
        }
    }
}

来源：https://stackoverflow.com/questions/22114707/how-to-reverse-a-string-that-contains-surrogate-pairs

标签

string

reverse

utf-16

surrogate-pairs