How can this Python Scrabble word finder be made faster?

前端 未结 5 835
梦如初夏
梦如初夏 2021-01-31 10:08

I have no real need to improve it, it\'s just for fun. Right now it\'s taking about a second on a list of about 200K words.

I\'ve tried to optimize it as much as I know

5条回答
  •  孤城傲影
    2021-01-31 10:33

    Without going too far from your basic code, here are some fairly simple optimizations:

    First, change your word reader to be:

    def word_reader(filename, L):
      L2 = L+2
      # returns an iterator
      return (word.strip() for word in open(filename) \
              if len(word) < L2 and len(word) > 2)
    

    and call it as

    words = word_reader('/usr/share/dict/words', len(rack))
    

    This gives the biggest improvement of all of my suggested changes. It eliminates words that are too long or short before we get too far in the process. Remember that word is unstripped of new line characters in my comparisons. I assumed '\n' line separators. Also, there could be a problem with the last word in the list because it probably won't have a new line at the end of it, but on my computer the last word is études which won't be found with our method anyway. Of course, you could just create your own dictionary beforehand from the original that removes those that aren't valid: those that aren't the right length or have letters outsize of a-z.

    Next, Ferran suggested a variable for the rack set, which is a good idea. However, you are also getting a pretty major slow down from making a set out of every word. The purpose of using the sets at all was to weed out a lot of the ones that don't have any shot at all and thereby give a speed-up. However, I found it was even faster to just check if the first letter of the word was in the rack before calling spellable:

    rackset = frozenset(rack)
    scored =  [(score_word(word), word) for word in words if word[0] in rackset \
               and spellable(word, rack)]
    

    However, this has to be accompanied by a change to spellable. I changed it to the following:

    def spellable(word, rack):
        return all( [rack.count(letter) >= word.count(letter) \
                     for letter in set(word)] )
    

    which, even without the changes in the previous step, is faster than what you currently have.

    With the three changes above, the code was about 3x faster from my simple tests.

    On to a better algorithm

    Since what you are really doing is looking for anagrams, it makes sense to use an anagram dictionary. An anagram dictionary takes each word in a dictionary and groups them if they are anagrams. For instance, 'takes' and 'skate' are anagrams of each other because they are both equal to 'aekst' when sorted. I created an anagram dictionary as a text file with the format where on each line constitutes an entry. Each entry has the sorted version of the sorted version of the anagrams and then the anagrams themselves. For the example I'm using the entry would be

    aekst skate takes
    

    Then I can just take combinations of the rack letters and do a binary search for each one in the anagram dictionary to see if there is a match. For a 7-letter rack, there is a maximum of 120 unique scrabble-valid combinations of the letters. Performing a binary search is O(log(N)) so this will be very fast.

    I implemented the algorithm in two parts. The first makes the anagram dictionary and the second is the actual scrabble cheating program.

    Anagram dictionary creator code

    f = open('/usr/share/dict/words')
    d = {}
    lets = set('abcdefghijklmnopqrstuvwxyz\n')
    for word in f:
      if len(set(word) - lets) == 0 and len(word) > 2 and len(word) < 9:
        word = word.strip()
        key = ''.join(sorted(word))
        if key in d:
          d[key].append(word)
        else:
          d[key] = [word]
    f.close()
    anadict = [' '.join([key]+value) for key, value in d.iteritems()]
    anadict.sort()
    f = open('anadict.txt','w')
    f.write('\n'.join(anadict))
    f.close()
    

    Scrabble cheating code

    from bisect import bisect_left
    from itertools import combinations
    from time import time
    
    def loadvars():
      f = open('anadict.txt','r')
      anadict = f.read().split('\n')
      f.close()
      return anadict
    
    scores = {"a": 1, "c": 3, "b": 3, "e": 1, "d": 2, "g": 2, 
             "f": 4, "i": 1, "h": 4, "k": 5, "j": 8, "m": 3, 
             "l": 1, "o": 1, "n": 1, "q": 10, "p": 3, "s": 1, 
             "r": 1, "u": 1, "t": 1, "w": 4, "v": 4, "y": 4, 
             "x": 8, "z": 10}
    
    def score_word(word):
      return sum([scores[c] for c in word])
    
    def findwords(rack, anadict):
      rack = ''.join(sorted(rack))
      foundwords = []
      for i in xrange(2,len(rack)+1):
        for comb in combinations(rack,i):
          ana = ''.join(comb)
          j = bisect_left(anadict, ana)
          if j == len(anadict):
            continue
          words = anadict[j].split()
          if words[0] == ana:
            foundwords.extend(words[1:])
      return foundwords
    
    if __name__ == "__main__":
      import sys
      if len(sys.argv) == 2:
        rack = sys.argv[1].strip()
      else:
        print """Usage: python cheat_at_scrabble.py """
        exit()
      t = time()
      anadict = loadvars()
      print "Dictionary loading time:",(time()-t)
      t = time()
      foundwords = set(findwords(rack, anadict))
      scored = [(score_word(word), word) for word in foundwords]
      scored.sort()
      for score, word in scored:
        print "%d\t%s" % (score,word)
      print "Time elapsed:", (time()-t)
    

    The anagram dictionary creator takes about a half a second on my machine. When the dictionary is already created, running the scrabble cheating program is about 15x faster than the OP's code and 5x faster than OP's code after my aforementioned changes. Also, the start-up time of loading the dictionary is much larger than actually searching for words from a rack, so this is a much better way for doing multiple searches at once.

提交回复
热议问题