Finding multiple substrings in a string without iterating over it multiple times

后端未结

关注

 3  611

I need to find if items from a list appear in a string, and then add the items to a different list. This code works:

data =[]
line = \'akhgvfalfhda.dhgfa.lidh


                      
              相关标签:


      
      
        
          3条回答        

        
                         				            
            
           
            
                              
                
              
              
                
                  迷失自我        
                
              
                            
                2021-01-04 21:49
              
            
            
                                                                       
One way I could think of to improve is:

Get all unique lengths of the words in _legal
Build a dictionary of words from line of those particular lengths using a sliding window technique. The complexity should be O( len(line)*num_of_unique_lengths ), this should be better than brute force.
Now look for each thing in the dictionary in O(1).

Code:
line = 'thing1 thing2 456 xxualt542l lthin. dfjladjfj lauthina '
_legal = ['thing1', 'thing2', 'thing3', 'thing4', 't5', '5', 'fj la']
ul = {len(i) for i in _legal}
s=set()
for l in ul:
    s = s.union({line[i:i+l] for i in range(len(line)-l)})
print(s.intersection(set(_legal)))

Output:
{'thing1', 'fj la', 'thing2', 't5', '5'}

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  粉色の甜心        
                
              
                            
                2021-01-04 21:57
              
            
            
                                                                       
The following should work quite efficiently. It implements the suggestion from @SomeDude given above
lens=set([len(i) for i in _legal])
d={}
for k in lens:
    d[k]=[line[i:i+k] for i in range(len(line)-k)]
s=set(sum(d.values(), []))
result=list(s.intersection(set(_legal)))

for the following data (i changed "line" a bit as it returned an empty list due to uppecase in Thing1 and Thing2)
line = 'akhgvfalfhda.dhgfa.lidhfalihflaih**thing1**aoufgyafkugafkjhafkjhflahfklh**thing2**dlfkhalfhafli...'
_legal = ['thing1', 'thing2', 'thing3', 'thing4',...]

Output is:
print(result)

['thing2', 'thing1']

Explanation:
We saved all possible lengths of the words in substrings. For these lengths, we created all possible substrings in text (set s). Finally we found common items in s and substrings, which is the answer to the question.
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  盖世英雄少女心        
                
              
                            
                2021-01-04 22:08
              
            
            
                                                                       
One approach is to build a very simple regex pattern, and use re.findall() to find/extract any matched words in the string.
import re

line = 'akhgvfalfhda.dhgfa.lidhfalihflaih**Thing1**aoufgyafkugafkjhafkjhflahfklh**Thing2**dlfkhalfhafli...'
_legal = ['thing1', 'thing2', 'thing3', 'thing4']

exp = re.compile(r'|'.join(_legal), re.IGNORECASE)
exp.findall(line)

>>> ['Thing1', 'Thing2']

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
                             
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复