How can I optimize this Python code to generate all words with word-distance 1?

前端未结

关注

 12  852

Profiling shows this is the slowest segment of my code for a little word game I wrote:

def distance(word1, word2):
    difference = 0
    for i in range(len(word


                      
              相关标签:


      
      
        
          12条回答        

        
                         				            
            
           
            
                              
                
              
              
                
                  日久生厌        
                
              
                            
                2021-01-30 22:16
              
            
            
                                                                       
from itertools import izip

def is_neighbors(word1,word2):
    different = False
    for c1,c2 in izip(word1,word2):
        if c1 != c2:
            if different:
                return False
            different = True
    return different


Or maybe in-lining the izip code:

def is_neighbors(word1,word2):
    different = False
    next1 = iter(word1).next
    next2 = iter(word2).next
    try:
        while 1:
            if next1() != next2():
                if different:
                    return False
                different = True
    except StopIteration:
        pass
    return different


And a rewritten getchildren:

def iterchildren(word, wordlist):
    return ( w for w in wordlist if is_neighbors(word, w) )



izip(a,b) returns an iterator over pairs of values from a and b.
zip(a,b) returns a list of pairs from a and b.

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  抹茶落季        
                
              
                            
                2021-01-30 22:19
              
            
            
                                                                       
Your function distance is calculating the total distance, when you really only care about distance=1. The majority of cases you'll know it's >1 within a few characters, so you could return early and save a lot of time.

Beyond that, there might be a better algorithm, but I can't think of it.

Edit: Another idea.

You can make 2 cases, depending on whether the first character matches. If it doesn't match, the rest of the word has to match exactly, and you can test for that in one shot. Otherwise, do it similarly to what you were doing. You could even do it recursively, but I don't think that would be faster.

def DifferentByOne(word1, word2):
    if word1[0] != word2[0]:
        return word1[1:] == word2[1:]
    same = True
    for i in range(1, len(word1)):
        if word1[i] != word2[i]:
            if same:
                same = False
            else:
                return False
    return not same


Edit 2: I've deleted the check to see if the strings are the same length, since you say it's redundant. Running Ryan's tests on my own code and on the is_neighbors function provided by MizardX, I get the following:


Original distance(): 3.7 seconds
My DifferentByOne(): 1.1 seconds
MizardX's is_neighbors(): 3.7 seconds


Edit 3: (Probably getting into community wiki territory here, but...)

Trying your final definition of is_neighbors() with izip instead of zip: 2.9 seconds.

Here's my latest version, which still times at 1.1 seconds:

def DifferentByOne(word1, word2):
    if word1[0] != word2[0]:
        return word1[1:] == word2[1:]
    different = False
    for i in range(1, len(word1)):
        if word1[i] != word2[i]:
            if different:
                return False
            different = True
    return different

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  滥情空心        
                
              
                            
                2021-01-30 22:19
              
            
            
                                                                       
Well you can start by having your loop break if the difference is 2 or more.

Also you can change

for i in range(len(word1)):


to 

for i in xrange(len(word1)):


Because xrange generates sequences on demand instead of generating the whole range of numbers at once.

You can also try comparing word lengths which would be quicker. Also note that your code doesn't work if word1 is greater than word2

There's not much else you can do algorithmically after that, which is to say you'll probably find more of a speedup by porting that section to C.

Edit 2

Attempting to explain my analysis of Sumudu's algorithm compared to verifying differences char by char.

When you have a word of length L, the number of "differs-by-one" words you will generate will be 25L. We know from implementations of sets on modern computers, that the  search speed is approximately log(n) base 2, where n is the number of elements to search for. 

Seeing that most of the 5 million words you test against is not in the set, most of the time, you will be traversing the entire set, which means that it really becomes log(25L) instead of only log(25L)/2. (and this is assuming best case scenario for sets that comparing string by string is equivalent to comparing char by char)

Now we take a look at the time complexity for determining a "differs-by-one". If we assume that you have to check the entire word, then the number of operations per word becomes L. We know that most words differ by 2 very quickly. And knowing that most prefixes take up a small portion of the word, we can logically assume that you will break most of the time by L/2, or half the word (and this is a conservative estimate).

So now we plot the time complexities of the two searches, L/2 and log(25L), and keeping in mind that this is even considering string matching the same speed as char matching (highly in favor of sets). You have the equation log(25*L) > L/2, which can be simplified down to log(25) > L/2 - log(L). As you can see from the graph, it should be quicker to use the char matching algorithm until you reach very large numbers of L.




  Also, I don't know if you're counting breaking on difference of 2 or more in your optimization, but from Mark's answer I already break on a difference of 2 or more, and actually, if the difference in the first letter, it breaks after the first letter, and even in spite of all those optimizations, changing to using sets just blew them out of the water. I'm interested in trying your idea though


I was the first person in this question to suggest breaking on a difference of 2 or more. The thing is, that Mark's idea of string slicing (if word1[0] != word2[0]: return word1[1:] == word2[1:]) is simply putting what we are doing into C. How do you think word1[1:] == word2[1:] is calculated? The same way that we are doing.


  I read your explanation a few times but I didn't quite follow it, would you mind explaining it a little more indepth? Also I'm not terribly familiar with C and I've been working in high-level languages for the past few years (closest has been learning C++ in high school 6 years ago


As for producing the C code, I am a bit busy. I am sure you will be able to do it since you have written in C before. You could also try C#, which probably has similar performance characteristics.

More Explanation

Here is a more indepth explanation to Davy8

def getchildren(word, wordlist):
    oneoff = one_letter_off_strings(word)
    return set(oneoff) & set(wordlist)


Your one_letter_off_strings function will create a set of 25L strings(where L is the number of letters). 

Creating a set from the wordlist will create a set of D strings (where D is the length of your dictionary). By creating an intersection from this, you MUST iterate over each oneoff and see if it exists in wordlist. 

The time complexity for this operation is detailed above. This operation is less efficient than comparing the word you want with each word in wordlist. Sumudu's method is an optimization in C rather than in algorithm.

More Explanation 2


  There's only 4500 total words (because the wordlist is pre-filtered for 5 letter words before even being passed to the algorithm), being intersected with 125 one-letter-off words. It seemed that you were saying intersection is log(smaller) or in otherwords log(125, 2). Compare this to again assuming what you said, where comparing a word breaks in L/2 letters, I'll round this down to 2, even though for a 5 letter word it's more likely to be 3. This comparison is done 4500 times, so 9000. log(125,2) is about 6.9, and log(4500,2) is about 12. Lemme know if I misinterpreted your numbers.


To create the intersection of 125 one-letter-off words with a dictionary of 4500, you need to make 125 * 4500 comparisons. This is not log(125,2). It is at best 125 * log(4500, 2) assuming that the dictionary is presorted. There is no magic shortcut to sets. You are also doing a string by string instead of char by char comparison here.
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  粉色の甜心        
                
              
                            
                2021-01-30 22:21
              
            
            
                                                                       
People are mainly going about this by trying to write a quicker function, but there might be another way..


  "distance" is called over 5 million times


Why is this? Perhaps a better way to optimise is to try and reduce the number of calls to distance, rather than shaving milliseconds of distance's execution time. It's impossible to tell without seeing the full script, but optimising a specific function is generally unnecessary.

If that is impossible, perhaps you could write it as a C module?
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  借酒劲吻你        
                
              
                            
                2021-01-30 22:21
              
            
            
                                                                       
for this snippet:

for x,y in zip (word1, word2):
    if x != y:
        difference += 1
return difference


i'd use this one:

return sum(1 for i in xrange(len(word1)) if word1[i] == word2[i])


the same pattern would follow all around the provided code...
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  深忆病人        
                
              
                            
                2021-01-30 22:23
              
            
            
                                                                       
Try this:

def distance(word1, word2):
  return sum([not c1 == c2 for c1, c2 in zip(word1,word2)])


Also, do you have a link to your game? I like being destroyed by word games
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
   
          
     1
2
下一页
           
           
        
                                  
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复