How can this Python Scrabble word finder be made faster?

前端 未结 5 834
梦如初夏
梦如初夏 2021-01-31 10:08

I have no real need to improve it, it\'s just for fun. Right now it\'s taking about a second on a list of about 200K words.

I\'ve tried to optimize it as much as I know

相关标签:
5条回答
  • 2021-01-31 10:33

    Without going too far from your basic code, here are some fairly simple optimizations:

    First, change your word reader to be:

    def word_reader(filename, L):
      L2 = L+2
      # returns an iterator
      return (word.strip() for word in open(filename) \
              if len(word) < L2 and len(word) > 2)
    

    and call it as

    words = word_reader('/usr/share/dict/words', len(rack))
    

    This gives the biggest improvement of all of my suggested changes. It eliminates words that are too long or short before we get too far in the process. Remember that word is unstripped of new line characters in my comparisons. I assumed '\n' line separators. Also, there could be a problem with the last word in the list because it probably won't have a new line at the end of it, but on my computer the last word is études which won't be found with our method anyway. Of course, you could just create your own dictionary beforehand from the original that removes those that aren't valid: those that aren't the right length or have letters outsize of a-z.

    Next, Ferran suggested a variable for the rack set, which is a good idea. However, you are also getting a pretty major slow down from making a set out of every word. The purpose of using the sets at all was to weed out a lot of the ones that don't have any shot at all and thereby give a speed-up. However, I found it was even faster to just check if the first letter of the word was in the rack before calling spellable:

    rackset = frozenset(rack)
    scored =  [(score_word(word), word) for word in words if word[0] in rackset \
               and spellable(word, rack)]
    

    However, this has to be accompanied by a change to spellable. I changed it to the following:

    def spellable(word, rack):
        return all( [rack.count(letter) >= word.count(letter) \
                     for letter in set(word)] )
    

    which, even without the changes in the previous step, is faster than what you currently have.

    With the three changes above, the code was about 3x faster from my simple tests.

    On to a better algorithm

    Since what you are really doing is looking for anagrams, it makes sense to use an anagram dictionary. An anagram dictionary takes each word in a dictionary and groups them if they are anagrams. For instance, 'takes' and 'skate' are anagrams of each other because they are both equal to 'aekst' when sorted. I created an anagram dictionary as a text file with the format where on each line constitutes an entry. Each entry has the sorted version of the sorted version of the anagrams and then the anagrams themselves. For the example I'm using the entry would be

    aekst skate takes
    

    Then I can just take combinations of the rack letters and do a binary search for each one in the anagram dictionary to see if there is a match. For a 7-letter rack, there is a maximum of 120 unique scrabble-valid combinations of the letters. Performing a binary search is O(log(N)) so this will be very fast.

    I implemented the algorithm in two parts. The first makes the anagram dictionary and the second is the actual scrabble cheating program.

    Anagram dictionary creator code

    f = open('/usr/share/dict/words')
    d = {}
    lets = set('abcdefghijklmnopqrstuvwxyz\n')
    for word in f:
      if len(set(word) - lets) == 0 and len(word) > 2 and len(word) < 9:
        word = word.strip()
        key = ''.join(sorted(word))
        if key in d:
          d[key].append(word)
        else:
          d[key] = [word]
    f.close()
    anadict = [' '.join([key]+value) for key, value in d.iteritems()]
    anadict.sort()
    f = open('anadict.txt','w')
    f.write('\n'.join(anadict))
    f.close()
    

    Scrabble cheating code

    from bisect import bisect_left
    from itertools import combinations
    from time import time
    
    def loadvars():
      f = open('anadict.txt','r')
      anadict = f.read().split('\n')
      f.close()
      return anadict
    
    scores = {"a": 1, "c": 3, "b": 3, "e": 1, "d": 2, "g": 2, 
             "f": 4, "i": 1, "h": 4, "k": 5, "j": 8, "m": 3, 
             "l": 1, "o": 1, "n": 1, "q": 10, "p": 3, "s": 1, 
             "r": 1, "u": 1, "t": 1, "w": 4, "v": 4, "y": 4, 
             "x": 8, "z": 10}
    
    def score_word(word):
      return sum([scores[c] for c in word])
    
    def findwords(rack, anadict):
      rack = ''.join(sorted(rack))
      foundwords = []
      for i in xrange(2,len(rack)+1):
        for comb in combinations(rack,i):
          ana = ''.join(comb)
          j = bisect_left(anadict, ana)
          if j == len(anadict):
            continue
          words = anadict[j].split()
          if words[0] == ana:
            foundwords.extend(words[1:])
      return foundwords
    
    if __name__ == "__main__":
      import sys
      if len(sys.argv) == 2:
        rack = sys.argv[1].strip()
      else:
        print """Usage: python cheat_at_scrabble.py <yourrack>"""
        exit()
      t = time()
      anadict = loadvars()
      print "Dictionary loading time:",(time()-t)
      t = time()
      foundwords = set(findwords(rack, anadict))
      scored = [(score_word(word), word) for word in foundwords]
      scored.sort()
      for score, word in scored:
        print "%d\t%s" % (score,word)
      print "Time elapsed:", (time()-t)
    

    The anagram dictionary creator takes about a half a second on my machine. When the dictionary is already created, running the scrabble cheating program is about 15x faster than the OP's code and 5x faster than OP's code after my aforementioned changes. Also, the start-up time of loading the dictionary is much larger than actually searching for words from a rack, so this is a much better way for doing multiple searches at once.

    0 讨论(0)
  • 2021-01-31 10:33

    You can use the fact the /usr/dict/share/words dictionary is sorted to allow you to skip a lot of words in the dictionary without considering them at all.

    For instance, suppose a dictionary word starts with "A" and you don't have "A" in the rack. You can do a binary search on the word list for the first word which starts with a "B" and skip all the words inbetween. This will make a big difference in most cases - you'll skip maybe half the words.

    0 讨论(0)
  • 2021-01-31 10:42

    Organize your data better. Instead of reading through a linear dictionary and comparing, you could pre-build a tree structure with those letter count vectors (well, "vectors"), and save that to a file.

    0 讨论(0)
  • 2021-01-31 10:50
    import trie
    
    
    def walk_trie(trie_node, rack, path=""):
        if trie_node.value is None:
            yield path
        for i in xrange(len(rack)):
            sub_rack = rack[:i] + rack[i+1:]
            if trie_node.nodes.has_key(rack[i]):
                for word in walk_trie(trie_node.nodes[rack[i]], sub_rack, path+rack[i]):
                    yield word
    
    
    if __name__ == "__main__":
        print "Generating trie... "
    
        # You might choose to skip words starting with a capital
        # rather than lower-casing and searching everything. Capitalised
        # words are probably pronouns which aren't allowed in Scrabble
    
        # I've skipped words shorter than 3 characters.
        all_words = ((line.strip().lower(), None) for line in open("/usr/share/dict/words") if len(line.strip()) >= 3)
        word_trie = trie.Trie(mapping=all_words)
        print "Walking Trie... "
        print list(walk_trie(word_trie.root, "abcdefg"))
    

    Generating the trie takes a little while, but once generated getting the list of words should be much faster than looping over a list.

    If anyone knows a way to serialise a trie that would be a great addition.

    Just to demonstrate that generating the trie is what takes the time...

       ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        98333    5.344    0.000    8.694    0.000 trie.py:87(__setitem__)
       832722    1.849    0.000    1.849    0.000 trie.py:10(__init__)
       832721    1.501    0.000    1.501    0.000 {method 'setdefault' of 'dict' objects}
        98334    1.005    0.000    1.730    0.000 scrabble.py:16(<genexpr>)
            1    0.491    0.491   10.915   10.915 trie.py:82(extend)
       196902    0.366    0.000    0.366    0.000 {method 'strip' of 'str' objects}
        98333    0.183    0.000    0.183    0.000 {method 'lower' of 'str' objects}
        98707    0.177    0.000    0.177    0.000 {len}
       285/33    0.003    0.000    0.004    0.000 scrabble.py:4(walk_trie)
          545    0.001    0.000    0.001    0.000 {method 'has_key' of 'dict' objects}
            1    0.001    0.001   10.921   10.921 {execfile}
            1    0.001    0.001   10.920   10.920 scrabble.py:1(<module>)
            1    0.000    0.000    0.000    0.000 trie.py:1(<module>)
            1    0.000    0.000    0.000    0.000 {open}
            1    0.000    0.000    0.000    0.000 trie.py:5(Node)
            1    0.000    0.000   10.915   10.915 trie.py:72(__init__)
            1    0.000    0.000    0.000    0.000 trie.py:33(Trie)
            1    0.000    0.000   10.921   10.921 <string>:1(<module>)
            1    0.000    0.000    0.000    0.000 {method 'split' of 'str' objects}
            1    0.000    0.000    0.000    0.000 trie.py:1(NeedMore)
            1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}
    
    0 讨论(0)
  • 2021-01-31 10:54

    you can convert more lists to generators:

    all( [word_count[letter] <= rack_count[letter] for letter in word] )  
    ...
    sum([score[c] for c in word])
    

    to

    all( word_count[letter] <= rack_count[letter] for letter in word ) 
    ...
    sum( score[c] for c in word )
    

    In the loop, instead of creating the rask set on every iteration, you can create it in advance, and it can be a frozenset.

    rack_set = frozenset(rack)
    scored =  ((score_word(word), word) for word in words if set(word).issubset(rask_set) and len(word) > 1 and spellable(word, rack))
    

    The same can be done with the rack_count dictionary. It doesn't need to be created on every iteration.

    rack_count  = count_letters(rack)
    
    0 讨论(0)
提交回复
热议问题