Regular Expressions in Python unexpectedly slow

后端未结

关注

 4  1734

迷失自我 2021-02-01 16:16

Consider this Python code:

import timeit
import re

def one():
        any(s in mystring for s in (\'foo\', \'bar\', \'hello\'))

r = re.compile(\'(foo|bar|hello


      
      
        
          4条回答        

        
                    
            
            
                         
                
              
              
                
                   醉酒成梦
                                             
                
                
                (楼主)
            
              
              
                2021-02-01 16:32
              

            
            
                        
The reason the regex is so slow is because it not only has to go through the whole string, but it has to several calculations at every character.

The first one simply does this:

Does f match h? No.
Does b match h? No.
Does h match h? Yes.
Does e match e? Yes.
Does l match l? Yes.
Does l match l? Yes.
Does o match o? Yes.
Done. Match found.


The second one does this:

Does f match g? No.
Does b match g? No.
Does h match g? No.
Does f match o? No.
Does b match o? No.
Does h match o? No.
Does f match o? No.
Does b match o? No.
Does h match o? No.
Does f match d? No.
Does b match d? No.
Does h match d? No.
Does f match b? No.
Does b match b? Yes.
Does a match y? No.
Does h match b? No.
Does f match y? No.
Does b match y? No.
Does h match y? No.
Does f match e? No.
Does b match e? No.
Does h match e? No.
... 999 more times ...
Done. No match found.


I can only speculate about the difference between the any and regex, but I'm guessing the regex is slower mostly because it runs in a highly complex engine, and with state machine stuff and everything, it just isn't as efficient as a specific implementation (in).

In the first string, the regex will find a match almost instantaneously, while any has to loop through the string twice before finding anything.

In the second string, however, the any performs essentially the same steps as the regex, but in a different order. This seems to point out that the any solution is faster, probably because it is simpler.

Specific code is more efficient than generic code. Any knowledge about the problem can be put to use in optimizing the solution. Simple code is preferred over complex code. Essentially, the regex is faster when the pattern will be near the start of the string, but in is faster when the pattern is near the end of the string, or not found at all.

Disclaimer: I don't know Python. I know algorithms.
    
             
                                                        
            
            
              
                
                0
              
                   
                
               讨论(0)
              
                                                  
              
              
                          
             
       
          
              
                                       
     查看其它4个回答


            
                         
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
                              			
        
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复