Why doesn't finite repetition in lookbehind work in some flavors?

前端 未结 4 1237
臣服心动
臣服心动 2020-12-10 21:11

I want to parse the 2 digits in the middle from a date in dd/mm/yy format but also allowing single digits for day and month.

This is what I came up with

相关标签:
4条回答
  • 2020-12-10 21:47

    Unless there's a specific reason for using the lookbehind which isn't noted in the question, how about simply matching the whole thing and only capturing the bit you're interested in instead?

    JavaScript example:

    >>> /^\d{1,2}\/(\d{1,2})\/\d{1,2}$/.exec("12/12/12")[1]
    "12"
    
    0 讨论(0)
  • 2020-12-10 21:48

    In addition to those listed by @polygenelubricants, there are two more exceptions to the "fixed length only" rule. In PCRE (the regex engine for PHP, Apache, et al) and Oniguruma (Ruby 1.9, Textmate), a lookbehind may consist of an alternation in which each alternative may match a different number of characters, as long as the length of each alternative is fixed. For example:

    (?<=\b\d\d/|\b\d/)\d{1,2}(?=/\d{2}\b)
    

    Note that the alternation has to be at the top level of the lookbehind subexpression. You might, like me, be tempted to factor out the common elements, like this:

    (?<=\b(?:\d\d/|\d)/)\d{1,2}(?=/\d{2}\b)
    

    ...but it wouldn't work; at the top level, the subexpression now consists of a single alternative with a non-fixed length.

    The second exception is much more useful: \K, supported by Perl and PCRE. It effectively means "pretend the match really started here." Whatever appears before it in the regex is treated as a positive lookbehind. As with .NET lookbehinds, there are no restrictions; whatever can appear in a normal regex can be used before the \K.

    \b\d{1,2}/\K\d{1,2}(?=/\d{2}\b)
    

    But most of the time, when someone has a problem with lookbehinds, it turns out they shouldn't even be using them. As @insin pointed out, this problem can be solved much more easily by using a capturing group.

    EDIT: Almost forgot JGSoft, the regex flavor used by EditPad Pro and PowerGrep; like .NET, it has completely unrestricted lookbehinds, positive and negative.

    0 讨论(0)
  • 2020-12-10 21:51

    On lookbehind support

    Major regex flavors have varying supports for lookbehind differently; some imposes certain restrictions, and some doesn't even support it at all.

    • Javascript: not supported
    • Python: fixed length only
    • Java: finite length only
    • .NET: no restriction

    References

    • regular-expressions.info/Flavor comparison

    Python

    In Python, where only fixed length lookbehind is supported, your original pattern raises an error because \d{1,2} obviously does not have a fixed length. You can "fix" this by alternating on two different fixed-length lookbehinds, e.g. something like this:

    (?<=^\d\/)\d{1,2}|(?<=^\d\d\/)\d{1,2}
    

    Or perhaps you can put both lookbehinds as alternates of a non-capturing group:

    (?:(?<=^\d\/)|(?<=^\d\d\/))\d{1,2}
    

    (note that you can just use \d without the brackets).

    That said, it's probably much simpler to use a capturing group instead:

    ^\d{1,2}\/(\d{1,2})
    

    Note that findall returns what group 1 captures if you only have one group. Capturing group is more widely supported than lookbehind, and often leads to a more readable pattern (such as in this case).

    This snippet illustrates all of the above points:

    p = re.compile(r'(?:(?<=^\d\/)|(?<=^\d\d\/))\d{1,2}')
    
    print(p.findall("12/34/56"))   # "[34]"
    print(p.findall("1/23/45"))    # "[23]"
    
    p = re.compile(r'^\d{1,2}\/(\d{1,2})')
    
    print(p.findall("12/34/56"))   # "[34]"
    print(p.findall("1/23/45"))    # "[23]"
    
    p = re.compile(r'(?<=^\d{1,2}\/)\d{1,2}')
    # raise error("look-behind requires fixed-width pattern")
    

    References

    • regular-expressions.info/Lookarounds, Character classes, Alternation, Capturing groups

    Java

    Java supports only finite-length lookbehind, so you can use \d{1,2} like in the original pattern. This is demonstrated by the following snippet:

        String text =
            "12/34/56 date\n" +
            "1/23/45 another date\n";
    
        Pattern p = Pattern.compile("(?m)(?<=^\\d{1,2}/)\\d{1,2}");
        Matcher m = p.matcher(text);
        while (m.find()) {
            System.out.println(m.group());
        } // "34", "23"
    

    Note that (?m) is the embedded Pattern.MULTILINE so that ^ matches the start of every line. Note also that since \ is an escape character for string literals, you must write "\\" to get one backslash in Java.


    C-Sharp

    C# supports full regex on lookbehind. The following snippet shows how you can use + repetition on a lookbehind:

    var text = @"
    1/23/45
    12/34/56
    123/45/67
    1234/56/78
    ";
    
    Regex r = new Regex(@"(?m)(?<=^\d+/)\d{1,2}");
    foreach (Match m in r.Matches(text)) {
      Console.WriteLine(m);
    } // "23", "34", "45", "56"
    

    Note that unlike Java, in C# you can use @-quoted string so that you don't have to escape \.

    For completeness, here's how you'd use the capturing group option in C#:

    Regex r = new Regex(@"(?m)^\d+/(\d{1,2})");
    foreach (Match m in r.Matches(text)) {
      Console.WriteLine("Matched [" + m + "]; month = " + m.Groups[1]);
    }
    

    Given the previous text, this prints:

    Matched [1/23]; month = 23
    Matched [12/34]; month = 34
    Matched [123/45]; month = 45
    Matched [1234/56]; month = 56
    

    Related questions

    • How can I match on, but exclude a regex pattern?
    0 讨论(0)
  • 2020-12-10 21:57

    To quote regular-expressions.info:

    The bad news is that most regex flavors do not allow you to use just any regex inside a lookbehind, because they cannot apply a regular expression backwards. Therefore, the regular expression engine needs to be able to figure out how many steps to step back before checking the lookbehind.

    Therefore, many regex flavors, including those used by Perl and Python, only allow fixed-length strings. You can use any regex of which the length of the match can be predetermined. This means you can use literal text and character classes. You cannot use repetition or optional items. You can use alternation, but only if all options in the alternation have the same length.

    In other words your regex does not work because you're using a variable-width expression inside a lookbehind and your regex engine does not support that.

    0 讨论(0)
提交回复
热议问题