How can I optimize this Python code to generate all words with word-distance 1?

前端未结

关注

 12  853

Profiling shows this is the slowest segment of my code for a little word game I wrote:

def distance(word1, word2):
    difference = 0
    for i in range(len(word


                      
              相关标签:


      
      
        
          12条回答        

        
                         				            
            
           
            
                              
                
              
              
                
                  时光取名叫无心        
                
              
                            
                2021-01-30 22:26
              
            
            
                                                                       
Everyone else focused just on explicit distance-calculation without doing anything about constructing the distance-1 candidates.
You can improve by using a well-known data-structure called a Trie to merge the implicit distance-calculation with the task of generating all distance-1 neighbor words. A Trie is a linked-list where each node stands for a letter, and the 'next' field is a dict with up to 26 entries, pointing to the next node.

Here's the pseudocode: walk the Trie iteratively for your given word; at each node add all distance-0 and distance-1 neighbors to the results; keep a counter of distance and decrement it. You don't need recursion, just a lookup function which takes an extra distance_so_far integer argument.

A minor tradeoff of extra speed for O(N) space increase can be gotten by building separate Tries for length-3, length-4, length-5 etc. words. 
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  醉话见心        
                
              
                            
                2021-01-30 22:28
              
            
            
                                                                       
If your wordlist is very long, might it be more efficient to generate all possible 1-letter-differences from 'word', then check which ones are in the list?  I don't know any Python but there should be a suitable data structure for the wordlist allowing for log-time lookups.

I suggest this because if your words are reasonable lengths (~10 letters), then you'll only be looking for 250 potential words, which is probably faster if your wordlist is larger than a few hundred words.
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  野趣味        
                
              
                            
                2021-01-30 22:28
              
            
            
                                                                       
First thing to occur to me:

from operator import ne

def distance(word1, word2):
    return sum(map(ne, word1, word2))


which has a decent chance of going faster than other functions people have posted, because it has no interpreted loops, just calls to Python primitives. And it's short enough that you could reasonably inline it into the caller.

For your higher-level problem, I'd look into the data structures developed for similarity search in metric spaces, e.g. this paper or this book, neither of which I've read (they came up in a search for a paper I have read but can't remember).
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  时光取名叫无心        
                
              
                            
                2021-01-30 22:33
              
            
            
                                                                       
How often is the distance function called with the same arguments? A simple to implement optimization would be to use memoization. 

You could probably also create some sort of dictionary with frozensets of letters and lists of words that differ by one and look up values in that. This datastructure could either be stored and loaded through pickle or generated from scratch at startup.

Short circuiting the evaluation will only give you gains if the words you are using are very long, since the hamming distance algorithm you're using is basically O(n) where n is the word length. 

I did some experiments with timeit for some alternative approaches that may be illustrative.

Timeit Results

Your Solution

d = """\
def distance(word1, word2):
    difference = 0
    for i in range(len(word1)):
        if word1[i] != word2[i]:
            difference += 1
    return difference
"""
t1 = timeit.Timer('distance("hello", "belko")', d)
print t1.timeit() # prints 6.502113536776391


One Liner

d = """\
from itertools import izip
def hamdist(s1, s2):
    return sum(ch1 != ch2 for ch1, ch2 in izip(s1,s2))
"""
t2 = timeit.Timer('hamdist("hello", "belko")', d)
print t2.timeit() # prints 10.985101179


Shortcut Evaluation

d = """\
def distance_is_one(word1, word2):
    diff = 0
    for i in xrange(len(word1)):
        if word1[i] != word2[i]:
            diff += 1
        if diff > 1:
            return False
    return diff == 1
"""
t3 = timeit.Timer('hamdist("hello", "belko")', d)
print t2.timeit() # prints 6.63337

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  感动是毒        
                
              
                            
                2021-01-30 22:33
              
            
            
                                                                       
I don't know if it will significantly affect your speed, but you could start by turning the list comprehension into a generator expression.  It's still iterable so it shouldn't be much different in usage:

def getchildren(word, wordlist):
    return [ w for w in wordlist if distance(word, w) == 1 ]


to

def getchildren(word, wordlist):
    return ( w for w in wordlist if distance(word, w) == 1 )


The main problem would be that a list comprehension would construct itself in memory and take up quite a bit of space, whereas the generator will create your list on the fly so there is no need to store the whole thing.

Also, following on unknown's answer, this may be a more "pythonic" way of writing distance():

def distance(word1, word2):
    difference = 0
    for x,y in zip (word1, word2):
        if x == y:
            difference += 1
    return difference


But it's confusing what's intended when len (word1) != len (word2), in the case of zip it will only return as many characters as the shortest word. (Which could turn out to be an optimization...)
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  无人共我        
                
              
                            
                2021-01-30 22:35
              
            
            
                                                                       
For such a simple function that has such a large performance implication, I would probably make a C library and call it using ctypes.  One of reddit's founders claims they made the website 2x as fast using this technique.

You can also use psyco on this function, but beware that it can eat up a lot of memory.
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
   
          
     上一页
1
2
           
           
        
                                  
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复