Efficiently build a graph of words with given Hamming distance

后端 未结 4 673
情话喂你
情话喂你 2021-02-07 04:08

I want to build a graph from a list of words with Hamming distance of (say) 1, or to put it differently, two words are connected if they only differ from one letter (lo

相关标签:
4条回答
  • 2021-02-07 04:35

    There's no need to take a dependency on the alphabet size. Given a word bot, for example, insert it into a dictionary of word lists under the keys ?ot, b?t, bo?. Then, for each word list, connect all pairs.

    import collections
    
    
    d = collections.defaultdict(list)
    with open('/usr/share/dict/words') as f:
        for line in f:
            for word in line.split():
                if len(word) == 6:
                    for i in range(len(word)):
                        d[word[:i] + ' ' + word[i + 1:]].append(word)
    pairs = [(word1, word2) for s in d.values() for word1 in s for word2 in s if word1 < word2]
    print(len(pairs))
    
    0 讨论(0)
  • 2021-02-07 04:36

    Assuming you store your dictionary in a set(), so that lookup is O(1) in the average (worst case O(n)).

    You can generate all the valid words at hamming distance 1 from a word:

    >>> def neighbours(word):
    ...     for j in range(len(word)):
    ...         for d in string.ascii_lowercase:
    ...             word1 = ''.join(d if i==j else c for i,c in enumerate(word))
    ...             if word1 != word and word1 in words: yield word1
    ...
    >>> {word: list(neighbours(word)) for word in words}
    {'bot': ['lot'], 'lol': ['lot'], 'lot': ['bot', 'lol']}
    

    If M is the length of a word, L the length of the alphabet (i.e. 26), the worst case time complexity of finding neighbouring words with this approach is O(L*M*N).

    The time complexity of the "easy way" approach is O(N^2).

    When this approach is better? When L*M < N, i.e. if considering only lowercase letters, when M < N/26. (I considered only worst case here)

    Note: the average length of an english word is 5.1 letters. Thus, you should consider this approach if your dictionary size is bigger than 132 words.

    Probably it is possible to achieve better performance than this. However this was really simple to implement.

    Experimental benchmark:

    The "easy way" algorithm (A1):

    from itertools import zip_longest
    def hammingdist(w1,w2): return sum(1 if c1!=c2 else 0 for c1,c2 in zip_longest(w1,w2))
    def graph1(words): return {word: [n for n in words if hammingdist(word,n) == 1] for word in words}
    

    This algorithm (A2):

    def graph2(words): return {word: list(neighbours(word)) for word in words}
    

    Benchmarking code:

    for dict_size in range(100,6000,100):
        words = set([''.join(random.choice(string.ascii_lowercase) for x in range(3)) for _ in range(dict_size)])
        t1 = Timer(lambda: graph1()).timeit(10)
        t2 = Timer(lambda: graph2()).timeit(10)
        print('%d,%f,%f' % (dict_size,t1,t2))
    

    Output:

    100,0.119276,0.136940
    200,0.459325,0.233766
    300,0.958735,0.325848
    400,1.706914,0.446965
    500,2.744136,0.545569
    600,3.748029,0.682245
    700,5.443656,0.773449
    800,6.773326,0.874296
    900,8.535195,0.996929
    1000,10.445875,1.126241
    1100,12.510936,1.179570
    ...
    

    data plot

    I ran another benchmark with smaller steps of N to see it closer:

    10,0.002243,0.026343
    20,0.010982,0.070572
    30,0.023949,0.073169
    40,0.035697,0.090908
    50,0.057658,0.114725
    60,0.079863,0.135462
    70,0.107428,0.159410
    80,0.142211,0.176512
    90,0.182526,0.210243
    100,0.217721,0.218544
    110,0.268710,0.256711
    120,0.334201,0.268040
    130,0.383052,0.291999
    140,0.427078,0.312975
    150,0.501833,0.338531
    160,0.637434,0.355136
    170,0.635296,0.369626
    180,0.698631,0.400146
    190,0.904568,0.444710
    200,1.024610,0.486549
    210,1.008412,0.459280
    220,1.056356,0.501408
    ...
    

    data plot 2

    You see the tradeoff is very low (100 for dictionaries of words with length=3). For small dictionaries the O(N^2) algorithm perform slightly better, but that is easily beat by the O(LMN) algorithm as N grows.

    For dictionaries with longer words, the O(LMN) algorithm remains linear in N, it just has a different slope, so the tradeoff moves slightly to the right (130 for length=5).

    0 讨论(0)
  • 2021-02-07 04:41

    Ternary Search Trie supports Near-Neighbor Searching pretty well.

    If your dictionary is stored as TST then, I believe, average complexity of lookups while building your graph would be close to O(N*log(N)) on real world word dictionaries.

    And check Efficient auto-complete with a ternary search tree article.

    0 讨论(0)
  • 2021-02-07 04:58

    Here is linear O(N) algorithm, but with big constant factor (R * L * 2). R is radix (for latin alphabet it is 26). L is a medium length of word. 2 is a factor of adding/replacing wildcard character. So abc and aac and abca are two ops wich leads to hamming distance of 1.

    It is written in Ruby. And for 240k words it takes ~250Mb RAM and 136 seconds on average hardware

    Blueprint of graph implementation

    class Node
      attr_reader :val, :edges
    
      def initialize(val)
        @val = val
        @edges = {}
      end
    
      def <<(node)
        @edges[node.val] ||= true
      end
    
      def connected?(node)
        @edges[node.val]
      end
    
      def inspect
        "Val: #{@val}, edges: #{@edges.keys * ', '}"
      end
    end
    
    class Graph
      attr_reader :vertices
      def initialize
        @vertices = {}
      end
    
      def <<(val)
        @vertices[val] = Node.new(val)
      end
    
      def connect(node1, node2)
        # print "connecting #{size} #{node1.val}, #{node2.val}\r"
        node1 << node2
        node2 << node1
      end
    
      def each
        @vertices.each do |val, node|
          yield [val, node]
        end
      end
    
      def get(val)
        @vertices[val]
      end
    end
    

    The algorithm itself

    CHARACTERS = ('a'..'z').to_a
    graph = Graph.new
    
    # ~ 240 000 words
    File.read("/usr/share/dict/words").each_line.each do |word|
      word = word.chomp
      graph << word.downcase
    end
    
    graph.each do |val, node|
      CHARACTERS.each do |char|
        i = 0
        while i <= val.size
          node2 = graph.get(val[0, i] + char + val[i..-1])
          graph.connect(node, node2) if node2
          if i < val.size
            node2 = graph.get(val[0, i] + char + val[i+1..-1])
            graph.connect(node, node2) if node2
          end
          i += 1
        end
      end
    end
    
    0 讨论(0)
提交回复
热议问题