strange string.IndexOf behavour

前端 未结 1 1476
无人及你
无人及你 2020-12-16 05:12

I wrote the following snippet to get rid of excessive spaces in slabs of text

int index = text.IndexOf("  ");
while (index > 0)
{
    text = text         


        
相关标签:
1条回答
  • 2020-12-16 05:59

    Ah, the joys of text.

    What you most likely have there, but got lost when posting on SO, is a "soft hyphen".

    To reproduce the problem, I tried this code in LINQPad:

    void Main()
    {
        var text = "Test1 \u00ad Test2";
        int index = text.IndexOf("  ");
        while (index > 0)
        {
            text = text.Replace("  ", " ");
            index = text.IndexOf("  ");
        }
    }
    

    And sure enough, the above code just gets stuck in a loop.

    Note that \u00ad is the Unicode symbol for Soft Hyphen, according to CharMap. You can always copy and paste the character from CharMap as well, but posting it here on SO will replace it with its much more common cousin, the Hyphen-Minus, Unicode symbol u002d (the one on your keyboard.)

    You can read a small section in the documentation for the String Class which has this to say on the subject:

    String search methods, such as String.StartsWith and String.IndexOf, also can perform culture-sensitive or ordinal string comparisons. The following example illustrates the differences between ordinal and culture-sensitive comparisons using the IndexOf method. A culture-sensitive search in which the current culture is English (United States) considers the substring "oe" to match the ligature "œ". Because a soft hyphen (U+00AD) is a zero-width character, the search treats the soft hyphen as equivalent to Empty and finds a match at the beginning of the string. An ordinal search, on the other hand, does not find a match in either case.

    I've highlighted the relevant part, but I also remember a blog post about this exact problem a while back but my Google-Fu is failing me tonight.

    The problem here is that IndexOf and Replace use different methods for locating the text.

    Whereas IndexOf will consider the soft hyphen as "not really there", and thus discover the two spaces on each side of it as "two joined spaces", the Replace method won't, and thus won't remove either of them. Therefore the criteria is present for the loop to continue iterating, but since Replace doesn't remove the spaces that fit the criteria, it will never end. Undoubtedly there are other such characters in the Unicode symbol space that exhibit similar problems, but this is the most typical case I've seen.

    There's at least two ways of handling this:

    1. You can use Regex.Replace, which seems to not have this problem:

      text = Regex.Replace(text, "  +", " ");
      

      Personally I would probably use the whitespace special character in the Regular Expression, which is \s, but if you only want spaces, the above should do the trick.

    2. You can explicitly ask IndexOf to use an ordinal comparison, which won't get tripped up by text behaving like ... well ... text:

      index = text.IndexOf("  ", StringComparison.Ordinal);
      
    0 讨论(0)
提交回复
热议问题