Regex: Matching by exclusion, without look-ahead - is it possible?

后端 未结 4 1189
既然无缘
既然无缘 2020-11-28 11:03

In some regex flavors, [negative] zero-width assertions (look-ahead/look-behind) are not supported.

This makes it extremely difficult (impossible?) to state an excl

相关标签:
4条回答
  • 2020-11-28 11:31

    UPDATE: It fails "with two ff before oo" as @Ciantic pointed out in the comments.


    ^(f(o[^o]|[^o])|[^f])*$
    

    NOTE: It is much much easier just to negate a match on the client side instead of using the above regex.

    The regex assumes that each line ends with a newline char if it is not then see C++'s and grep's regexs.

    Sample programs in Perl, Python, C++, and grep all give the same output.

    • perl

      #!/usr/bin/perl -wn
      print if /^(f(o[^o]|[^o])|[^f])*$/;
      
    • python

      #!/usr/bin/env python
      import fileinput, re, sys
      from itertools import ifilter
      
      re_not_foo = re.compile(r"^(f(o[^o]|[^o])|[^f])*$")
      for line in ifilter(re_not_foo.match, fileinput.input()):
          sys.stdout.write(line)
      
    • c++

      #include <iostream>
      #include <string>
      #include <boost/regex.hpp>
      
      int main()
      {
        boost::regex re("^(f(o([^o]|$)|([^o]|$))|[^f])*$");
        //NOTE: "|$"s are there due to `getline()` strips newline char
      
        std::string line;
        while (std::getline(std::cin, line)) 
          if (boost::regex_match(line, re))
            std::cout << line << std::endl;
      }
      
    • grep

      $ grep "^\(f\(o\([^o]\|$\)\|\([^o]\|$\)\)\|[^f]\)*$" in.txt
      

    Sample file:

    foo
    'foo'
    abdfoode
    abdfode
    abdfde
    abcde
    f
    
    fo
    foo
    fooo
    ofooa
    ofo
    ofoo
    

    Output:

    abdfode
    abdfde
    abcde
    f
    
    fo
    ofo
    
    0 讨论(0)
  • 2020-11-28 11:36

    You can usually look for foo and invert the result of the regex match from the client code.

    For a simple example, let's say you want to validate that a string contains only certain characters.

    You could write that like this:

    ^[A-Za-z0-9.$-]*$

    and accept a true result as valid, or like this:

    [^A-Za-z0-9.$-]

    and accept a false result as valid.

    Of course, this isn't always an option: sometimes you just have to put the expression in a config file or pass it to another program, for example. But it's worth remembering. Your specific problem, for example, the expression is much simpler if you can use negation like this.

    0 讨论(0)
  • 2020-11-28 11:40

    Came across this Question and took the fact that there wasn't a fully-working regex as a personal challenge. I believe I've managed to create a regex that does work for all inputs - provided you can use atomic grouping/possessive quantifiers.

    Of course, I'm not sure if there are any flavours that allow atomic grouping but not lookaround, but the Question asked if it's possible in regex to state an exclusion without lookaround, and it is technically possible:

    \A(?:$|[^f]++|f++(?:[^o]|$)|(?:f++o)*+(?:[^o]|$))*\Z
    

    Explanation:

    \A                         #Start of string
    (?:                        #Non-capturing group
        $                      #Consume end-of-line. We're not in foo-mode.
        |[^f]++                #Consume every non-'f'. We're not in foo-mode.
        |f++(?:[^o]|$)          #Enter foo-mode with an 'f'. Consume all 'f's, but only exit foo-mode if 'o' is not the next character. Thus, 'f' is valid but 'fo' is invalid.
        |(?:f++o)*+(?:[^o]|$)  #Enter foo-mode with an 'f'. Consume all 'f's, followed by a single 'o'. Repeat, since '(f+o)*' by itself cannot contain 'foo'. Only exit foo-mode if 'o' is not the next character following (f+o). Thus, 'fo' is valid but 'foo' is invalid.
    )*                         #Repeat the non-capturing group
    \Z                         #End of string. Note that this regex only works in flavours that can match $\Z
    

    If, for whatever reason, you can use atomic grouping but not possessive quantifiers nor lookaround, you can use:

    \A(?:$|(?>[^f]+)|(?>f+)(?:[^o]|$)|(?>(?:(?>f+)o)*)(?:[^o]|$))*\Z
    

    As others point out, though, it's probably more practical to just negate a match through other means.

    0 讨论(0)
  • 2020-11-28 11:42

    I stumbled across this question looking for my own regex exclusion solution, where I am trying to exclude a sequence within my regex.

    My initial reaction to this situation: For example "every line that does not have "foo" on it" was simply to use the -v invert sense of matching option in grep.

    grep -v foo
    

    this returns all lines in a file that don't match 'foo'

    It's so simple I have the strong feeling I've just misread your question....

    0 讨论(0)
提交回复
热议问题