Fastest way to check if a string contains specific characters in any of the items in a list

前端未结

关注

 4  1081

What is the fastest way to check if a string contains some characters from any items of a list?

Currently, I\'m using this method:

lestring = \"Text1


                      
              相关标签:


      
      
        
          4条回答        

        
                         				            
            
           
            
                              
                
              
              
                
                  离开以前        
                
              
                            
                2021-01-02 05:01
              
            
            
                                                                       
if the test is to see if there are any characters in common (not words or segments), create a set out of the letters in the list and then check the letters agains the string:

char_list = set(''.join(list_of_words))
test_set = set(string_to_teat)
common_chars = char_list.intersection(test_set)


However I'm assuming you're looking for as little as one character in common...
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  轻奢々        
                
              
                            
                2021-01-02 05:02
              
            
            
                                                                       
The esmre library does the trick. In your case, the simpler, esm (part of esmre) is what you want.

https://pypi.python.org/pypi/esmre/

https://code.google.com/p/esmre/

They have good documentation and examples:
Taken from their examples:

>>> import esm
>>> index = esm.Index()
>>> index.enter("he")
>>> index.enter("she")
>>> index.enter("his")
>>> index.enter("hers")
>>> index.fix()
>>> index.query("this here is history")
[((1, 4), 'his'), ((5, 7), 'he'), ((13, 16), 'his')]
>>> index.query("Those are his sheep!")
[((10, 13), 'his'), ((14, 17), 'she'), ((15, 17), 'he')]
>>> 


I ran some performance tests:

import random, timeit, string, esm

def uz(lelist, lestring):
    for x in lelist:
        if lestring.count(x):
            return 'Yep. "%s" contains characters from "%s" item.' % (lestring, x)



def ab(lelist, lestring):
    return [e for e in lelist if e in lestring]


def use_esm(index, lestring):
    return index.query(lestring)

for TEXT_LEN in [5, 50, 1000]:
    for SEARCH_LEN in [5, 20]:
        for N in [5, 50, 1000, 10000]:
            if TEXT_LEN < SEARCH_LEN:
                continue

            print 'TEXT_LEN:', TEXT_LEN, 'SEARCH_LEN:', SEARCH_LEN, 'N:', N

            lestring = ''.join((random.choice(string.ascii_uppercase + string.digits) for _ in range(TEXT_LEN)))
            lelist = [''.join((random.choice(string.ascii_uppercase + string.digits) for _ in range(SEARCH_LEN))) for _
                      in range(N)]

            index = esm.Index()
            for i in lelist:
                index.enter(i)
            index.fix()

            t_ab = timeit.Timer("ab(lelist, lestring)", setup="from __main__ import lelist, lestring, ab")
            t_uz = timeit.Timer("uz(lelist, lestring)", setup="from __main__ import lelist, lestring, uz")
            t_esm = timeit.Timer("use_esm(index, lestring)", setup="from __main__ import index, lestring, use_esm")

            ab_time = t_ab.timeit(1000)
            uz_time = t_uz.timeit(1000)
            esm_time = t_esm.timeit(1000)

            min_time = min(ab_time, uz_time, esm_time)
            print '  ab%s: %f' % ('*' if ab_time == min_time else '', ab_time)
            print '  uz%s: %f' % ('*' if uz_time == min_time else '', uz_time)
            print '  esm%s %f:' % ('*' if esm_time == min_time else '', esm_time)


And got that results depends mostly on the number of items that one is looking for (in my case, 'N'):

TEXT_LEN: 1000 SEARCH_LEN: 20 N: 5
  ab*: 0.001733
  uz: 0.002512
  esm 0.126853:

TEXT_LEN: 1000 SEARCH_LEN: 20 N: 50
  ab*: 0.017564
  uz: 0.023701
  esm 0.079925:

TEXT_LEN: 1000 SEARCH_LEN: 20 N: 1000
  ab: 0.370371
  uz: 0.489523
  esm* 0.133783:

TEXT_LEN: 1000 SEARCH_LEN: 20 N: 10000
  ab: 3.678790
  uz: 4.883575
  esm* 0.259605:

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  滥情空心        
                
              
                            
                2021-01-02 05:15
              
            
            
                                                                       
You can try list comprehension with membership check

>>> lestring = "Text123"
>>> lelist = ["Text", "foo", "bar"]
>>> [e for e in lelist if e in lestring]
['Text']


Compared to your implementation, though LC has an implicit loop but its faster as there is no explicit function call as in your case with count

Compared to Joe's implementation, yours is way faster, as the filter function would require to call two functions in a loop, lambda and count

>>> def joe(lelist, lestring):
    return ''.join(random.sample(x + 'b'*len(x), len(x)))

>>> def uz(lelist, lestring):
    for x in lelist:
        if lestring.count(x):
            return 'Yep. "%s" contains characters from "%s" item.' % (lestring, x)


>>> def ab(lelist, lestring):
    return [e for e in lelist if e in lestring]

>>> t_ab = timeit.Timer("ab(lelist, lestring)", setup="from __main__ import lelist, lestring, ab")
>>> t_uz = timeit.Timer("uz(lelist, lestring)", setup="from __main__ import lelist, lestring, uz")
>>> t_joe = timeit.Timer("joe(lelist, lestring)", setup="from __main__ import lelist, lestring, joe")
>>> t_ab.timeit(100000)
0.09391469893125759
>>> t_uz.timeit(100000)
0.1528471407273173
>>> t_joe.timeit(100000)
1.4272649857800843




Jamie's commented solution is slower for shorter string's. Here is the test result

>>> def jamie(lelist, lestring):
    return next(itertools.chain((e for e in lelist if e in lestring), (None,))) is not None

>>> t_jamie = timeit.Timer("jamie(lelist, lestring)", setup="from __main__ import lelist, lestring, jamie")
>>> t_jamie.timeit(100000)
0.22237164127909637




If you need Boolean values, for shorter strings, just modify the above LC expression

[e in lestring for e in lelist if e in lestring]


Or for longer strings, you can do the following

>>> next(e in lestring for e in lelist if e in lestring)
True


or 

>>> any(e in lestring for e in lelist)

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  名媛妹妹        
                
              
                            
                2021-01-02 05:16
              
            
            
                                                                       
filter(lambda x: lestring.count(x), lelist)


That will return all the strings that you're trying to find as a list.
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
                             
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复