.NET Regex dot character matches carriage return?

后端 未结 5 1067
南笙
南笙 2020-12-31 06:39

Every single flavor of regex I have ever used has always had the \".\" character match everything but a new line (\\r or \\n)... unless, of course, you enable the single-lin

相关标签:
5条回答
  • 2020-12-31 07:08

    Well, I don't think that "there is something rotten in the state of Redmond!", at least your scenario is not a proof of it. But I do think that this behavior is not a bug but rather a feature. Why? Just because Perl regexes features the same behaviour (I just checked it) and I believe that PHP's PCREs (Perl Compatible Regular Expressions) behave the same way too. And MS just made their Regex methods behave the same way as de-facto classic Perl ones. And now my question is: "what's wrong in the JS kingdom?" :)

    0 讨论(0)
  • 2020-12-31 07:12

    I ran into this same issue when writing Regex Hero. It is a little bizarre. I blogged about the issue here. And that led to me adding a feature to the tester to enable/disable CRLFs. Anyway, for some reason Microsoft chose to use \n (line feeds) to mark line endings.

    (UPDATE) The reason must be related to this:

    Microsoft .NET Framework regular expressions incorporate the most popular features of other regular expression implementations such as those in Perl and awk. Designed to be compatible with Perl 5 regular expressions, .NET Framework regular expressions include features not yet seen in other implementations, such as right-to-left matching and on-the-fly compilation. http://msdn.microsoft.com/en-us/library/hs600312.aspx

    And as Igor noted, Perl has the same behavior.

    Now, the Singleline and Multiline RegexOptions change behavior based around dots and line feeds. You can enable the Singleline RegexOption so that the dot matches line feeds. And you can enable the Multiline RegexOption so that ^ and $ mark the beginning and end of every line (denoted by line feeds). But you can't change the inherent behavior of the dot (.) operator to match everything except for \r\n.

    0 讨论(0)
  • 2020-12-31 07:14

    Regular Expressions have a practical (as opposed to theoretical) origin in the Unix environment, where LF is the line terminator then it seems completely appropriate for . to match everything except LF.

    It's a single character match so matching CRLF would be too much to ask and matching CR or LF might cause problems with migrating regex's cross-platform. I think using \s would be a better approach for white-space matching and will match both CR and LF.

    0 讨论(0)
  • 2020-12-31 07:26

    I think the point here is that the dot is supposed to match anything that's not a line separator, and \r is a line separator. Perl gets away with recognizing only \n because it is (as others have pointed out) rooted in the Unix world, and because it's the inspiration for the regex flavors found in most other languages.

    (But I note that in Perl 6 regexes (or Rules, to use their formal name), /\n/ matches anything that's recognized by Unicode as a line separator, including both characters of a \r\n sequence.)

    .NET was born in the Unicode era; it should recognize all Unicode-endorsed line separators, including \r (older Mac style) and \r\n (which is used by some network protocols as well as Windows). Consider this example in Java:

    String s = "fee\nfie\r\nfoe\rfum";
    Pattern p = Pattern.compile("(?m)^.+$");
    Matcher m = p.matcher(s);
    while (m.find())
    {
      System.out.println(m.group().length());
    }
    

    result:

    3
    3
    3
    3
    

    ., ^ and $ all work correctly with all three line separators. Now try it in C#:

    string s = "fee\nfie\r\nfoe\rfum";
    Regex r = new Regex(@"(?m)^.+$");
    foreach (Match m in r.Matches(s))
    {
      Console.WriteLine(m.Value.Length);
    }
    

    result:

    3
    4
    7
    

    Does that look right to anyone else? Here we have the regex flavor built into Microsoft's .NET framework, and it doesn't even handle the Windows-standard line separator correctly. And it completely disregards a lone \r, as it does the other Unicode line separators. .NET came out several years after Java, and its Unicode support is at least as good, so why did they choose to stick on this point?

    0 讨论(0)
  • 2020-12-31 07:29

    Except in SingleLine mode, . will match every character except \n.
    As you've noticed, it does match \r.

    I don't know why.

    0 讨论(0)
提交回复
热议问题