Efficiently build a graph of words with given Hamming distance

后端未结

关注

 4  673

I want to build a graph from a list of words with Hamming distance of (say) 1, or to put it differently, two words are connected if they only differ from one letter (lo

相关标签:

4条回答

暗喜

2021-02-07 04:35

There's no need to take a dependency on the alphabet size. Given a word bot, for example, insert it into a dictionary of word lists under the keys ?ot, b?t, bo?. Then, for each word list, connect all pairs.

import collections d = collections.defaultdict(list) with open('/usr/share/dict/words') as f: for line in f: for word in line.split(): if len(word) == 6: for i in range(len(word)): d[word[:i] + ' ' + word[i + 1:]].append(word) pairs = [(word1, word2) for s in d.values() for word1 in s for word2 in s if word1 < word2] print(len(pairs))

0 讨论(0)

发布评论:

提交评论

加载中...

被撕碎了的回忆

2021-02-07 04:36

Assuming you store your dictionary in a set(), so that lookup is O(1) in the average (worst case O(n)).

You can generate all the valid words at hamming distance 1 from a word:

>>> def neighbours(word): ... for j in range(len(word)): ... for d in string.ascii_lowercase: ... word1 = ''.join(d if i==j else c for i,c in enumerate(word)) ... if word1 != word and word1 in words: yield word1 ... >>> {word: list(neighbours(word)) for word in words} {'bot': ['lot'], 'lol': ['lot'], 'lot': ['bot', 'lol']}

If M is the length of a word, L the length of the alphabet (i.e. 26), the worst case time complexity of finding neighbouring words with this approach is O(L*M*N).

The time complexity of the "easy way" approach is O(N^2).

When this approach is better? When L*M < N, i.e. if considering only lowercase letters, when M < N/26. (I considered only worst case here)

Note: the average length of an english word is 5.1 letters. Thus, you should consider this approach if your dictionary size is bigger than 132 words.

Probably it is possible to achieve better performance than this. However this was really simple to implement.

Experimental benchmark:

The "easy way" algorithm (A1):

from itertools import zip_longest def hammingdist(w1,w2): return sum(1 if c1!=c2 else 0 for c1,c2 in zip_longest(w1,w2)) def graph1(words): return {word: [n for n in words if hammingdist(word,n) == 1] for word in words}

This algorithm (A2):

def graph2(words): return {word: list(neighbours(word)) for word in words}

Benchmarking code:

for dict_size in range(100,6000,100): words = set([''.join(random.choice(string.ascii_lowercase) for x in range(3)) for _ in range(dict_size)]) t1 = Timer(lambda: graph1()).timeit(10) t2 = Timer(lambda: graph2()).timeit(10) print('%d,%f,%f' % (dict_size,t1,t2))

Output:

100,0.119276,0.136940 200,0.459325,0.233766 300,0.958735,0.325848 400,1.706914,0.446965 500,2.744136,0.545569 600,3.748029,0.682245 700,5.443656,0.773449 800,6.773326,0.874296 900,8.535195,0.996929 1000,10.445875,1.126241 1100,12.510936,1.179570 ...

I ran another benchmark with smaller steps of N to see it closer:

10,0.002243,0.026343 20,0.010982,0.070572 30,0.023949,0.073169 40,0.035697,0.090908 50,0.057658,0.114725 60,0.079863,0.135462 70,0.107428,0.159410 80,0.142211,0.176512 90,0.182526,0.210243 100,0.217721,0.218544 110,0.268710,0.256711 120,0.334201,0.268040 130,0.383052,0.291999 140,0.427078,0.312975 150,0.501833,0.338531 160,0.637434,0.355136 170,0.635296,0.369626 180,0.698631,0.400146 190,0.904568,0.444710 200,1.024610,0.486549 210,1.008412,0.459280 220,1.056356,0.501408 ...

You see the tradeoff is very low (100 for dictionaries of words with length=3). For small dictionaries the O(N^2) algorithm perform slightly better, but that is easily beat by the O(LMN) algorithm as N grows.

For dictionaries with longer words, the O(LMN) algorithm remains linear in N, it just has a different slope, so the tradeoff moves slightly to the right (130 for length=5).

0 讨论(0)

发布评论:

提交评论

加载中...

离开以前

2021-02-07 04:41

Ternary Search Trie supports Near-Neighbor Searching pretty well.

If your dictionary is stored as TST then, I believe, average complexity of lookups while building your graph would be close to O(N*log(N)) on real world word dictionaries.

And check Efficient auto-complete with a ternary search tree article.

0 讨论(0)

发布评论:

提交评论

加载中...

臣服心动

2021-02-07 04:58

Here is linear O(N) algorithm, but with big constant factor (R * L * 2). R is radix (for latin alphabet it is 26). L is a medium length of word. 2 is a factor of adding/replacing wildcard character. So abc and aac and abca are two ops wich leads to hamming distance of 1.

It is written in Ruby. And for 240k words it takes ~250Mb RAM and 136 seconds on average hardware

Blueprint of graph implementation

class Node attr_reader :val, :edges def initialize(val) @val = val @edges = {} end def <<(node) @edges[node.val] ||= true end def connected?(node) @edges[node.val] end def inspect "Val: #{@val}, edges: #{@edges.keys * ', '}" end end class Graph attr_reader :vertices def initialize @vertices = {} end def <<(val) @vertices[val] = Node.new(val) end def connect(node1, node2) # print "connecting #{size} #{node1.val}, #{node2.val}\r" node1 << node2 node2 << node1 end def each @vertices.each do |val, node| yield [val, node] end end def get(val) @vertices[val] end end

The algorithm itself

CHARACTERS = ('a'..'z').to_a graph = Graph.new # ~ 240 000 words File.read("/usr/share/dict/words").each_line.each do |word| word = word.chomp graph << word.downcase end graph.each do |val, node| CHARACTERS.each do |char| i = 0 while i <= val.size node2 = graph.get(val[0, i] + char + val[i..-1]) graph.connect(node, node2) if node2 if i < val.size node2 = graph.get(val[0, i] + char + val[i+1..-1]) graph.connect(node, node2) if node2 end i += 1 end end end

0 讨论(0)

发布评论:

提交评论

加载中...

验证码

看不清?

提交回复