How to reverse a string that contains surrogate pairs

泪湿孤枕 提交于 2019-12-01 07:31:18

\u10000 is a string of two characters: က (Unicode code point 1000) followed by a 0 (which can be detected by inspecting the value of s in your method). If you reverse two characters, they won't match the input anymore.

It seems you're after Unicode Character 'LINEAR B SYLLABLE B008 A' (U+10000) with hexadecimal code point 10000. From Unicode character escape sequences on MSDN:

\u hex-digit hex-digit hex-digit hex-digit

\U hex-digit hex-digit hex-digit hex-digit hex-digit hex-digit hex-digit hex-digit

So you'll have to use either four or eight digits.

Use \U00010000 (notice the capital U) or \uD800\uDC00 instead of \u10000.

Necromancing.
This happens because you use List<char>.Reverse instead of List<string>.Reverse

// using System.Globalization;

TextElementEnumerator enumerator =
    StringInfo.GetTextElementEnumerator("Les Mise\u0301rables");

List<string> elements = new List<string>();
while (enumerator.MoveNext())
    elements.Add(enumerator.GetTextElement());

elements.Reverse();
string reversed = string.Concat(elements);  // selbarésiM seL

See Jon Skeet's pony video for more information: https://vimeo.com/7403673

Here's how you properly reverse a string (a string, not a sequence of chars):

public static class Test
{

    private static System.Collections.Generic.List<string> GraphemeClusters(string s)
    {
        System.Collections.Generic.List<string> ls = new System.Collections.Generic.List<string>();

        System.Globalization.TextElementEnumerator enumerator = System.Globalization.StringInfo.GetTextElementEnumerator(s);
        while (enumerator.MoveNext())
        {
            ls.Add((string)enumerator.Current);
        }

        return ls;
    }


    // this 
    private static string ReverseGraphemeClusters(string s)
    {
         if(string.IsNullOrEmpty(s) || s.Length == 1)
              return s;

        System.Collections.Generic.List<string> ls = GraphemeClusters(s);
        ls.Reverse();

        return string.Join("", ls.ToArray());
    }

    public static void TestMe()
    {
        string s = "Les Mise\u0301rables";
        string r = ReverseGraphemeClusters(s);

        // This would be wrong:
        // char[] a = s.ToCharArray();
        // System.Array.Reverse(a);
        // string r = new string(a);

        System.Console.WriteLine(r);
    }
}

Note that you need to know the difference between
- a character and a glyph
- a byte (8 bit) and a codepoint/rune (32 bit)
- a codepoint and a GraphemeCluster [32+ bit] (aka Grapheme/Glyph)

Reference:

Character is an overloaded term than can mean many things.

A code point is the atomic unit of information. Text is a sequence of code points. Each code point is a number which is given meaning by the Unicode standard.

A grapheme is a sequence of one or more code points that are displayed as a single, graphical unit that a reader recognizes as a single element of the writing system. For example, both a and ä are graphemes, but they may consist of multiple code points (e.g. ä may be two code points, one for the base character a followed by one for the diaresis; but there's also an alternative, legacy, single code point representing this grapheme). Some code points are never part of any grapheme (e.g. the zero-width non-joiner, or directional overrides).

A glyph is an image, usually stored in a font (which is a collection of glyphs), used to represent graphemes or parts thereof. Fonts may compose multiple glyphs into a single representation, for example, if the above ä is a single code point, a font may chose to render that as two separate, spatially overlaid glyphs. For OTF, the font's GSUB and GPOS tables contain substitution and positioning information to make this work. A font may contain multiple alternative glyphs for the same grapheme, too.

This is a start. It might not be the fastest, but it does seem to work for what we have thrown at it.

internal static string ReverseItWithSurrogate(string stringToReverse)
{
    string result = string.Empty;

    // We want to get the string into a character array first
    char[] stringArray = stringToReverse.ToCharArray();

    // This is the object that will hold our reversed string.
    var sb = new StringBuilder();
    bool haveSurrogate = false;

    // We are starting at the back and looking at each character.  if it is a
    // low surrogate and the one prior is a high and not < 0, then we have a surrogate pair.
    for (int loopVariable = stringArray.Length - 1; loopVariable >= 0; loopVariable--)
    {
    // we cant' check the high surrogate if the low surrogate is index 0
    if (loopVariable > 0)
    {
        haveSurrogate = false;

        if (char.IsLowSurrogate(stringArray[loopVariable]) &&    char.IsHighSurrogate(stringArray[loopVariable - 1]))
       {
          sb.Append(stringArray[loopVariable - 1]);
          sb.Append(stringArray[loopVariable]);

         // and force the second character to drop from our loop
         loopVariable--;
         haveSurrogate = true;
       }

      if (!haveSurrogate)
      {
         sb.Append(stringArray[loopVariable]);
        }
       }
    else
    {
     // Now we have to handle the first item in the list if it is not a high surrogate.
      if (!haveSurrogate)
      {
        sb.Append(stringArray[loopVariable]);
       }
     }
   }

result = sb.ToString();
return result;
}

best viewed NOT in Chrome!

using System.Linq;
using System.Collections.Generic;
using System;
using System.Globalization;
using System.Diagnostics;
using System.Collections;
namespace OrisNumbers
{
    public static class IEnumeratorExtensions
    {
        public static IEnumerable<T> AsIEnumerable<T>(this IEnumerator iterator)
        {
            while (iterator.MoveNext())
            {
                yield return (T)iterator.Current;
            }
        }
    }
    class Program
    {
        static void Main(string[] args)
        {
            var s = "foo 𝌆 bar mañana mañana" ;
            Debug.WriteLine(s);
            Debug.WriteLine(string.Join("", StringInfo.GetTextElementEnumerator(s.Normalize()).AsIEnumerable<string>().Reverse()));
            Console.Read();
        }
    }
}
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!