How can I optimize this Python code to generate all words with word-distance 1?

前端 未结 12 835
予麋鹿
予麋鹿 2021-01-30 22:11

Profiling shows this is the slowest segment of my code for a little word game I wrote:

def distance(word1, word2):
    difference = 0
    for i in range(len(word         


        
相关标签:
12条回答
  • 2021-01-30 22:26

    Everyone else focused just on explicit distance-calculation without doing anything about constructing the distance-1 candidates. You can improve by using a well-known data-structure called a Trie to merge the implicit distance-calculation with the task of generating all distance-1 neighbor words. A Trie is a linked-list where each node stands for a letter, and the 'next' field is a dict with up to 26 entries, pointing to the next node.

    Here's the pseudocode: walk the Trie iteratively for your given word; at each node add all distance-0 and distance-1 neighbors to the results; keep a counter of distance and decrement it. You don't need recursion, just a lookup function which takes an extra distance_so_far integer argument.

    A minor tradeoff of extra speed for O(N) space increase can be gotten by building separate Tries for length-3, length-4, length-5 etc. words.

    0 讨论(0)
  • 2021-01-30 22:28

    If your wordlist is very long, might it be more efficient to generate all possible 1-letter-differences from 'word', then check which ones are in the list? I don't know any Python but there should be a suitable data structure for the wordlist allowing for log-time lookups.

    I suggest this because if your words are reasonable lengths (~10 letters), then you'll only be looking for 250 potential words, which is probably faster if your wordlist is larger than a few hundred words.

    0 讨论(0)
  • 2021-01-30 22:28

    First thing to occur to me:

    from operator import ne
    
    def distance(word1, word2):
        return sum(map(ne, word1, word2))
    

    which has a decent chance of going faster than other functions people have posted, because it has no interpreted loops, just calls to Python primitives. And it's short enough that you could reasonably inline it into the caller.

    For your higher-level problem, I'd look into the data structures developed for similarity search in metric spaces, e.g. this paper or this book, neither of which I've read (they came up in a search for a paper I have read but can't remember).

    0 讨论(0)
  • 2021-01-30 22:33

    How often is the distance function called with the same arguments? A simple to implement optimization would be to use memoization.

    You could probably also create some sort of dictionary with frozensets of letters and lists of words that differ by one and look up values in that. This datastructure could either be stored and loaded through pickle or generated from scratch at startup.

    Short circuiting the evaluation will only give you gains if the words you are using are very long, since the hamming distance algorithm you're using is basically O(n) where n is the word length.

    I did some experiments with timeit for some alternative approaches that may be illustrative.

    Timeit Results

    Your Solution

    d = """\
    def distance(word1, word2):
        difference = 0
        for i in range(len(word1)):
            if word1[i] != word2[i]:
                difference += 1
        return difference
    """
    t1 = timeit.Timer('distance("hello", "belko")', d)
    print t1.timeit() # prints 6.502113536776391
    

    One Liner

    d = """\
    from itertools import izip
    def hamdist(s1, s2):
        return sum(ch1 != ch2 for ch1, ch2 in izip(s1,s2))
    """
    t2 = timeit.Timer('hamdist("hello", "belko")', d)
    print t2.timeit() # prints 10.985101179
    

    Shortcut Evaluation

    d = """\
    def distance_is_one(word1, word2):
        diff = 0
        for i in xrange(len(word1)):
            if word1[i] != word2[i]:
                diff += 1
            if diff > 1:
                return False
        return diff == 1
    """
    t3 = timeit.Timer('hamdist("hello", "belko")', d)
    print t2.timeit() # prints 6.63337
    
    0 讨论(0)
  • 2021-01-30 22:33

    I don't know if it will significantly affect your speed, but you could start by turning the list comprehension into a generator expression. It's still iterable so it shouldn't be much different in usage:

    def getchildren(word, wordlist):
        return [ w for w in wordlist if distance(word, w) == 1 ]
    

    to

    def getchildren(word, wordlist):
        return ( w for w in wordlist if distance(word, w) == 1 )
    

    The main problem would be that a list comprehension would construct itself in memory and take up quite a bit of space, whereas the generator will create your list on the fly so there is no need to store the whole thing.

    Also, following on unknown's answer, this may be a more "pythonic" way of writing distance():

    def distance(word1, word2):
        difference = 0
        for x,y in zip (word1, word2):
            if x == y:
                difference += 1
        return difference
    

    But it's confusing what's intended when len (word1) != len (word2), in the case of zip it will only return as many characters as the shortest word. (Which could turn out to be an optimization...)

    0 讨论(0)
  • 2021-01-30 22:35

    For such a simple function that has such a large performance implication, I would probably make a C library and call it using ctypes. One of reddit's founders claims they made the website 2x as fast using this technique.

    You can also use psyco on this function, but beware that it can eat up a lot of memory.

    0 讨论(0)
提交回复
热议问题