How to tell if one regular expression matches a subset of another regular expression?

前端未结

关注

 7  2237

I\'m just wondering if it\'s possible to use one regular expression to match another, that is some sort of:

[\'a-z\'].match([\'b-x\'])
True

[\'m-n\'].match(


                      
              相关标签:


      
      
        
          7条回答        

        
                         				            
            
           
            
                              
                
              
              
                
                  终归单人心        
                
              
                            
                2020-12-01 14:55
              
            
            
                                                                       
Some clarification is required, I think:

.

rA = re.compile('(?<! )[a-z52]+')


'(?<! )[a-z52]+' is a pattern

rA is an instance of class RegexObject whose type is < *type '_sre.SRE_Pattern' >* . 

Personally I use the term regex exclusively for this kind of objects, not for the pattern . 

Note that rB = re.compile(rA) is also possible, it produces the same object (id(rA) == id(rB) equals to True)



ch = 'lshdgbfcs luyt52uir bqisuytfqr454'
x = rA.match(ch) 
# or
y = rA.search(ch)


x and y are instances of the class MatchObject whose type is *< type '_sre.SRE_Match' >*



.

That said, you want to know if there a way to determine if a regex rA can match all the strings matched by another regex rB while rB matches only a subsest of all the strings matched by rA.

I don't think such a way exists, whatever theoretically or practically.
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  有刺的猬        
                
              
                            
                2020-12-01 14:57
              
            
            
                                                                       
I think — in theory — to tell whether regexp A matches a subset of what regexp B matches, an algorithm could:


Compute the minimal Deterministic Finite Automaton of B and also of the "union" A|B.
Check if the two DFAs are identical. This is true if and only if A matches a subset of what B matches.


However, it would likely be a major project to do this in practice. There are explanations such as Constructing a minimum-state DFA from a Regular Expression but they only tend to consider mathematically pure regexps. You would also have to handle the extensions that Python adds for convenience. Moreover, if any of the extensions cause the language to be non-regular (I am not sure if this is the case) you might not be able to handle those ones.

But what are you trying to do? Perhaps there's an easier approach...?
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  别那么骄傲        
                
              
                            
                2020-12-01 15:07
              
            
            
                                                                       
Verification of the post by "antinome" using two regex : 55*  and 5* :

REGEX_A: 55* [This matches "5", "55", "555" etc. and does NOT match "4" , "54" etc]

REGEX_B: 5* [This matches "", "5" "55", "555" etc. and does NOT match "4" , "54" etc]

[Here we've assumed that 55* is not implicitly .55.* and 5* is not .5.* - This is why 5* does not match 4]

REGEX_A can have an NFA as below:

  {A}--5-->{B}--epsilon-->{C}--5-->{D}--epsilon-->{E}
           {B} -----------------epsilon --------> {E} 
                          {C} <--- epsilon ------ {E}


REGEX_B can have an NFA as below:

  {A}--epsilon-->{B}--5-->{C}--epsilon-->{D}
  {A} --------------epsilon -----------> {D} 
                 {B} <--- epsilon ------ {D}


Now we can derive NFA * DFA of (REGEX_A|REGEX_B)  as below:

  NFA:
  {state A}  ---epsilon --> {state B} ---5--> {state C} ---5--> {state D}
                                              {state C} ---epsilon --> {state D} 
                                              {state C} <---epsilon -- {state D}
  {state A}  ---epsilon --> {state E} ---5--> {state F}
                            {state E} ---epsilon --> {state F} 
                            {state E} <---epsilon -- {state F}

  NFA -> DFA:

       |   5          |  epsilon*
   ----+--------------+--------
    A  |  B,C,E,F,G   |   A,C,E,F
    B  |  C,D,E,F     |   B,C,E,F
    c  |  C,D,E,F     |   C
    D  |  C,D,E,F,G   |   C,D,E,F
    E  |  C,D,E,F,G   |   C,E,F
    F  |  C,E,F,G     |   F
    G  |  C,D,E,G     |   C,E,F,G

                    5(epsilon*)
    -------------+---------------------
              A  |  B,C,E,F,G 
      B,C,E,F,G  |  C,D,E,F,G 
      C,D,E,F,G  |  C,D,E,F,G 

    Finally the DFA for (REGEX_A|REGEX_B) is:
         {A}--5--->{B,C,E,F,G}--5--->{C,D,E,F,G}
                                     {C,D,E,F,G}---5--> {C,D,E,F,G}

         Note: {A} is start state and {C,D,E,F,G} is accepting state. 


Similarly DFA for REGEX_A (55*) is:

       |   5    |  epsilon*
   ----+--------+--------
    A  | B,C,E  |   A
    B  | C,D,E  |   B,C,E
    C  | C,D,E  |   C
    D  | C,D,E  |   C,D,E
    E  | C,D,E  |   C,E


            5(epsilon*)
   -------+---------------------
       A  |  B,C,E  
   B,C,E  |  C,D,E
   C,D,E  |  C,D,E

    {A} ---- 5 -----> {B,C,E}--5--->{C,D,E}
                                    {C,D,E}--5--->{C,D,E}
Note: {A} is start state and {C,D,E} is accepting state


Similarly DFA for REGEX_B (5*) is:

       |   5    |  epsilon*
   ----+--------+--------
    A  | B,C,D  |   A,B,D
    B  | B,C,D  |   B
    C  | B,C,D  |   B,C,D
    D  | B,C,D  |   B,D


            5(epsilon*)
   -------+---------------------
       A  |  B,C,D  
   B,C,D  |  B,C,D

    {A} ---- 5 -----> {B,C,D}
                      {B,C,D} --- 5 ---> {B,C,D}
Note: {A} is start state and {B,C,D} is accepting state


Conclusions:

DFA of REGX_A|REGX_B identical to DFA of REGX_A 
      -- implies REGEX_A is subset of REGEX_B
DFA of REGX_A|REGX_B is NOT identical to DFA of REGX_B 
      -- cannot infer about either gerexes.

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  旧时难觅i        
                
              
                            
                2020-12-01 15:10
              
            
            
                                                                       
It's possible with the string representation of a regex, since any string can be matched with regexes, but not with the compiled version returned by re.compile. I don't see what use this would be, though. Also, it takes a different syntax.

Edit: you seem to be looking for the ability to detect whether the language defined by an RE is a subset of another RE's. Yes, I think that's possible, but no, Python's re module doesn't do it.
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  醉梦人生        
                
              
                            
                2020-12-01 15:11
              
            
            
                                                                       
You should do something along these lines:

re.match("\[[b-x]-[b-x]\]", "[a-z]")


The regular expression has to define what the string should look like. If you want to match an opening square bracket followed by a letter from b to x, then a dash, then another letter from b to x and finally a closing square bracket, the solution above should work.

If you intend to validate that a regular expression is correct you should consider testing if it compiles instead.
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  旧巷少年郎        
                
              
                            
                2020-12-01 15:18
              
            
            
                                                                       
pip3 install https://github.com/leafstorm/lexington/archive/master.zip
python3
>>> from lexington.regex import Regex as R
>>> from lexington.regex import Null
>>> from functools import reduce
>>> from string import ascii_lowercase, digits
>>> a_z = reduce(lambda a, b: a | R(b), ascii_lowercase, Null)
>>> b_x = reduce(lambda a, b: a | R(b), ascii_lowercase[1:-2], Null)
>>> a_z | b_x == a_z
True
>>> m_n = R("m") | R("n")
>>> zero_nine = reduce(lambda a, b: a | R(b), digits, Null)
>>> m_n | zero_nine == m_n
False


Also tested successfully with Python 2. See also how to do it with a different library.

Alternatively, pip3 install https://github.com/ferno/greenery/archive/master.zip and:

from greenery.lego import parse as p
a_z = p("[a-z]")
b_x = p("[b-x]")
assert a_z | b_x == a_z
m_n = p("m|n")
zero_nine = p("[0-9]")
assert not m_n | zero_nine == m_n

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
   
          
     1
2
下一页
           
           
        
                                  
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复