How to tell if one regular expression matches a subset of another regular expression?

前端 未结 7 2230
孤独总比滥情好
孤独总比滥情好 2020-12-01 14:23

I\'m just wondering if it\'s possible to use one regular expression to match another, that is some sort of:

[\'a-z\'].match([\'b-x\'])
True

[\'m-n\'].match(         


        
相关标签:
7条回答
  • 2020-12-01 14:55

    Some clarification is required, I think:

    .

    rA = re.compile('(?<! )[a-z52]+')
    

    '(?<! )[a-z52]+' is a pattern

    rA is an instance of class RegexObject whose type is < *type '_sre.SRE_Pattern' >* .

    Personally I use the term regex exclusively for this kind of objects, not for the pattern .

    Note that rB = re.compile(rA) is also possible, it produces the same object (id(rA) == id(rB) equals to True)


    ch = 'lshdgbfcs luyt52uir bqisuytfqr454'
    x = rA.match(ch) 
    # or
    y = rA.search(ch)
    

    x and y are instances of the class MatchObject whose type is *< type '_sre.SRE_Match' >*


    .

    That said, you want to know if there a way to determine if a regex rA can match all the strings matched by another regex rB while rB matches only a subsest of all the strings matched by rA.

    I don't think such a way exists, whatever theoretically or practically.

    0 讨论(0)
  • 2020-12-01 14:57

    I think — in theory — to tell whether regexp A matches a subset of what regexp B matches, an algorithm could:

    1. Compute the minimal Deterministic Finite Automaton of B and also of the "union" A|B.
    2. Check if the two DFAs are identical. This is true if and only if A matches a subset of what B matches.

    However, it would likely be a major project to do this in practice. There are explanations such as Constructing a minimum-state DFA from a Regular Expression but they only tend to consider mathematically pure regexps. You would also have to handle the extensions that Python adds for convenience. Moreover, if any of the extensions cause the language to be non-regular (I am not sure if this is the case) you might not be able to handle those ones.

    But what are you trying to do? Perhaps there's an easier approach...?

    0 讨论(0)
  • 2020-12-01 15:07

    Verification of the post by "antinome" using two regex : 55* and 5* :

    REGEX_A: 55* [This matches "5", "55", "555" etc. and does NOT match "4" , "54" etc]

    REGEX_B: 5* [This matches "", "5" "55", "555" etc. and does NOT match "4" , "54" etc]

    [Here we've assumed that 55* is not implicitly .55.* and 5* is not .5.* - This is why 5* does not match 4]

    REGEX_A can have an NFA as below:

      {A}--5-->{B}--epsilon-->{C}--5-->{D}--epsilon-->{E}
               {B} -----------------epsilon --------> {E} 
                              {C} <--- epsilon ------ {E}
    

    REGEX_B can have an NFA as below:

      {A}--epsilon-->{B}--5-->{C}--epsilon-->{D}
      {A} --------------epsilon -----------> {D} 
                     {B} <--- epsilon ------ {D}
    

    Now we can derive NFA * DFA of (REGEX_A|REGEX_B) as below:

      NFA:
      {state A}  ---epsilon --> {state B} ---5--> {state C} ---5--> {state D}
                                                  {state C} ---epsilon --> {state D} 
                                                  {state C} <---epsilon -- {state D}
      {state A}  ---epsilon --> {state E} ---5--> {state F}
                                {state E} ---epsilon --> {state F} 
                                {state E} <---epsilon -- {state F}
    
      NFA -> DFA:
    
           |   5          |  epsilon*
       ----+--------------+--------
        A  |  B,C,E,F,G   |   A,C,E,F
        B  |  C,D,E,F     |   B,C,E,F
        c  |  C,D,E,F     |   C
        D  |  C,D,E,F,G   |   C,D,E,F
        E  |  C,D,E,F,G   |   C,E,F
        F  |  C,E,F,G     |   F
        G  |  C,D,E,G     |   C,E,F,G
    
                        5(epsilon*)
        -------------+---------------------
                  A  |  B,C,E,F,G 
          B,C,E,F,G  |  C,D,E,F,G 
          C,D,E,F,G  |  C,D,E,F,G 
    
        Finally the DFA for (REGEX_A|REGEX_B) is:
             {A}--5--->{B,C,E,F,G}--5--->{C,D,E,F,G}
                                         {C,D,E,F,G}---5--> {C,D,E,F,G}
    
             Note: {A} is start state and {C,D,E,F,G} is accepting state. 
    

    Similarly DFA for REGEX_A (55*) is:

           |   5    |  epsilon*
       ----+--------+--------
        A  | B,C,E  |   A
        B  | C,D,E  |   B,C,E
        C  | C,D,E  |   C
        D  | C,D,E  |   C,D,E
        E  | C,D,E  |   C,E
    
    
                5(epsilon*)
       -------+---------------------
           A  |  B,C,E  
       B,C,E  |  C,D,E
       C,D,E  |  C,D,E
    
        {A} ---- 5 -----> {B,C,E}--5--->{C,D,E}
                                        {C,D,E}--5--->{C,D,E}
    Note: {A} is start state and {C,D,E} is accepting state
    

    Similarly DFA for REGEX_B (5*) is:

           |   5    |  epsilon*
       ----+--------+--------
        A  | B,C,D  |   A,B,D
        B  | B,C,D  |   B
        C  | B,C,D  |   B,C,D
        D  | B,C,D  |   B,D
    
    
                5(epsilon*)
       -------+---------------------
           A  |  B,C,D  
       B,C,D  |  B,C,D
    
        {A} ---- 5 -----> {B,C,D}
                          {B,C,D} --- 5 ---> {B,C,D}
    Note: {A} is start state and {B,C,D} is accepting state
    

    Conclusions:

    DFA of REGX_A|REGX_B identical to DFA of REGX_A 
          -- implies REGEX_A is subset of REGEX_B
    DFA of REGX_A|REGX_B is NOT identical to DFA of REGX_B 
          -- cannot infer about either gerexes.
    
    0 讨论(0)
  • 2020-12-01 15:10

    It's possible with the string representation of a regex, since any string can be matched with regexes, but not with the compiled version returned by re.compile. I don't see what use this would be, though. Also, it takes a different syntax.

    Edit: you seem to be looking for the ability to detect whether the language defined by an RE is a subset of another RE's. Yes, I think that's possible, but no, Python's re module doesn't do it.

    0 讨论(0)
  • 2020-12-01 15:11

    You should do something along these lines:

    re.match("\[[b-x]-[b-x]\]", "[a-z]")
    

    The regular expression has to define what the string should look like. If you want to match an opening square bracket followed by a letter from b to x, then a dash, then another letter from b to x and finally a closing square bracket, the solution above should work.

    If you intend to validate that a regular expression is correct you should consider testing if it compiles instead.

    0 讨论(0)
  • 2020-12-01 15:18
    pip3 install https://github.com/leafstorm/lexington/archive/master.zip
    python3
    >>> from lexington.regex import Regex as R
    >>> from lexington.regex import Null
    >>> from functools import reduce
    >>> from string import ascii_lowercase, digits
    >>> a_z = reduce(lambda a, b: a | R(b), ascii_lowercase, Null)
    >>> b_x = reduce(lambda a, b: a | R(b), ascii_lowercase[1:-2], Null)
    >>> a_z | b_x == a_z
    True
    >>> m_n = R("m") | R("n")
    >>> zero_nine = reduce(lambda a, b: a | R(b), digits, Null)
    >>> m_n | zero_nine == m_n
    False
    

    Also tested successfully with Python 2. See also how to do it with a different library.

    Alternatively, pip3 install https://github.com/ferno/greenery/archive/master.zip and:

    from greenery.lego import parse as p
    a_z = p("[a-z]")
    b_x = p("[b-x]")
    assert a_z | b_x == a_z
    m_n = p("m|n")
    zero_nine = p("[0-9]")
    assert not m_n | zero_nine == m_n
    
    0 讨论(0)
提交回复
热议问题