.NET Regex dot character matches carriage return?

后端 未结 5 1069
南笙
南笙 2020-12-31 06:39

Every single flavor of regex I have ever used has always had the \".\" character match everything but a new line (\\r or \\n)... unless, of course, you enable the single-lin

5条回答
  •  隐瞒了意图╮
    2020-12-31 07:26

    I think the point here is that the dot is supposed to match anything that's not a line separator, and \r is a line separator. Perl gets away with recognizing only \n because it is (as others have pointed out) rooted in the Unix world, and because it's the inspiration for the regex flavors found in most other languages.

    (But I note that in Perl 6 regexes (or Rules, to use their formal name), /\n/ matches anything that's recognized by Unicode as a line separator, including both characters of a \r\n sequence.)

    .NET was born in the Unicode era; it should recognize all Unicode-endorsed line separators, including \r (older Mac style) and \r\n (which is used by some network protocols as well as Windows). Consider this example in Java:

    String s = "fee\nfie\r\nfoe\rfum";
    Pattern p = Pattern.compile("(?m)^.+$");
    Matcher m = p.matcher(s);
    while (m.find())
    {
      System.out.println(m.group().length());
    }
    

    result:

    3
    3
    3
    3
    

    ., ^ and $ all work correctly with all three line separators. Now try it in C#:

    string s = "fee\nfie\r\nfoe\rfum";
    Regex r = new Regex(@"(?m)^.+$");
    foreach (Match m in r.Matches(s))
    {
      Console.WriteLine(m.Value.Length);
    }
    

    result:

    3
    4
    7
    

    Does that look right to anyone else? Here we have the regex flavor built into Microsoft's .NET framework, and it doesn't even handle the Windows-standard line separator correctly. And it completely disregards a lone \r, as it does the other Unicode line separators. .NET came out several years after Java, and its Unicode support is at least as good, so why did they choose to stick on this point?

提交回复
热议问题