Search a list of strings for any sub-string from another list

前端未结

关注

 5  366

Given these 3 lists of data and a list of keywords:

good_data1 = [\'hello, world\', \'hey, world\']
good_data2 = [\'hey, man\', \'whats up\']
bad_data = [\'h


                      
              相关标签:


      
      
        
          5条回答        

        
                         				            
            
           
            
                              
                
              
              
                
                  庸人自扰        
                
              
                            
                2020-12-29 06:58
              
            
            
                                                                       
In your example, with so few items, it doesn't really matter. But if you have a list of several thousand items, this might help.

Since you don't care which element in the list contains the keyword, you can scan the whole list once (as one string) instead of one item at the time. For that you need a join character that you know won't occur in the keyword, in order to avoid false positives. I use the newline in this example.

def check_data(data):
    s = "\n".join(data);
    for k in keywords:
        if k in s:
            return True

    return False


In my completely unscientific test, my version checked a list of 5000 items 100000 times in about 30 seconds. I stopped your version after 3 minutes -- got tired of waiting to post =)
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  醉酒成梦        
                
              
                            
                2020-12-29 06:59
              
            
            
                                                                       
If you have many keywords, you might want to try a suffix tree [1]. Insert all the words from the three data lists, storing which list each word comes from in it's terminating node. Then you can perform queries on the tree for each keyword really, really fast.

Warning: suffix trees are very complicated to implement!

[1] http://en.wikipedia.org/wiki/Suffix_tree
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  萌比男神i        
                
              
                            
                2020-12-29 06:59
              
            
            
                                                                       
You may be able to improve matters by building your list of keywords as a regular expression.  

This may allow them to be tested in parallel, but will very much depend on what the keywords are (eg. some work may be reused testing for "hello" and "hell", rather than searching every phrase from the start for each word.

You could do this by executing:

import re
keyword_re = re.compile("|".join(map(re.escape, keywords)))


Then:

>>> bool(keyword_re.search('hello, world'))
True
>>> bool(keyword_re.search('hi, earth'))
False


(It will actually return a match object on found, and None if not found - this might be useful if you need to know which keyword matched)

However, how much (if anything) this gains you will depend on the keywords.  If you only have one or two, keep your current approach.  If you have a large list, it may be worth tring and profiling to see which performs better.

[Edit]
For reference, here's how the approaches do for your example:

               good1   good2  good3   bad1   bad2
original     : 0.206   0.233  0.229   0.390   63.879
gnud (join)  : 0.257   0.347  4.600   0.281    6.706
regex        : 0.766   1.018  0.397   0.764  124.351
regex (join) : 0.345   0.337  3.305   0.481   48.666


Obviously for this case, your approach performs far better than the regex one.  Whether this will always be the case depends a lot on the number and complexity of keywords, and the input data that will be checked.  For large numbers of keywords, and lengthy lists or rarely matching phrases, regexes may work better, but do get timing information, and perhaps try even simpler optimisations (like moving the most common words to the front of your keyword list) first.  Sometimes the simplest approach really is the best.

[Edit2] Updated the table with gnud's solution, and a similar approach before applying the regexes.  I also added 2 new tests:

good_data3 = good_data2 * 500  # 1000 items, the first of which matches.
bad_data2 = bad_data * 500     # 1000 items, none of which matches.


Which show up the various strengths and weaknesses.  Joining does do worse when a match  would immediately be found (as there is an always paid, up-front cost in joining the list - this is a best possible case for the linear search method), however for non-matching lists, it performs better.  Much better when there are a large number of items in the list.case).
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  北恋        
                
              
                            
                2020-12-29 07:06
              
            
            
                                                                       
I think this is pretty efficient and clear, though you could use map() to avoid the many nests. I agree with ross on the dictionary idea for larger lists.
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  时光说笑        
                
              
                            
                2020-12-29 07:09
              
            
            
                                                                       
Are you looking for

any( k in s for k in keywords )


It's more compact, but might be less efficient.
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
                             
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复