I want to build a graph from a list of words with Hamming distance of (say) 1, or to put it differently, two words are connected if they only differ from one letter (lo
Assuming you store your dictionary in a set()
, so that lookup is O(1) in the average (worst case O(n)).
You can generate all the valid words at hamming distance 1 from a word:
>>> def neighbours(word):
... for j in range(len(word)):
... for d in string.ascii_lowercase:
... word1 = ''.join(d if i==j else c for i,c in enumerate(word))
... if word1 != word and word1 in words: yield word1
...
>>> {word: list(neighbours(word)) for word in words}
{'bot': ['lot'], 'lol': ['lot'], 'lot': ['bot', 'lol']}
If M is the length of a word, L the length of the alphabet (i.e. 26), the worst case time complexity of finding neighbouring words with this approach is O(L*M*N).
The time complexity of the "easy way" approach is O(N^2).
When this approach is better? When L*M < N
, i.e. if considering only lowercase letters, when M < N/26
. (I considered only worst case here)
Note: the average length of an english word is 5.1 letters. Thus, you should consider this approach if your dictionary size is bigger than 132 words.
Probably it is possible to achieve better performance than this. However this was really simple to implement.
The "easy way" algorithm (A1):
from itertools import zip_longest
def hammingdist(w1,w2): return sum(1 if c1!=c2 else 0 for c1,c2 in zip_longest(w1,w2))
def graph1(words): return {word: [n for n in words if hammingdist(word,n) == 1] for word in words}
This algorithm (A2):
def graph2(words): return {word: list(neighbours(word)) for word in words}
Benchmarking code:
for dict_size in range(100,6000,100):
words = set([''.join(random.choice(string.ascii_lowercase) for x in range(3)) for _ in range(dict_size)])
t1 = Timer(lambda: graph1()).timeit(10)
t2 = Timer(lambda: graph2()).timeit(10)
print('%d,%f,%f' % (dict_size,t1,t2))
Output:
100,0.119276,0.136940
200,0.459325,0.233766
300,0.958735,0.325848
400,1.706914,0.446965
500,2.744136,0.545569
600,3.748029,0.682245
700,5.443656,0.773449
800,6.773326,0.874296
900,8.535195,0.996929
1000,10.445875,1.126241
1100,12.510936,1.179570
...
I ran another benchmark with smaller steps of N to see it closer:
10,0.002243,0.026343
20,0.010982,0.070572
30,0.023949,0.073169
40,0.035697,0.090908
50,0.057658,0.114725
60,0.079863,0.135462
70,0.107428,0.159410
80,0.142211,0.176512
90,0.182526,0.210243
100,0.217721,0.218544
110,0.268710,0.256711
120,0.334201,0.268040
130,0.383052,0.291999
140,0.427078,0.312975
150,0.501833,0.338531
160,0.637434,0.355136
170,0.635296,0.369626
180,0.698631,0.400146
190,0.904568,0.444710
200,1.024610,0.486549
210,1.008412,0.459280
220,1.056356,0.501408
...
You see the tradeoff is very low (100 for dictionaries of words with length=3). For small dictionaries the O(N^2) algorithm perform slightly better, but that is easily beat by the O(LMN) algorithm as N grows.
For dictionaries with longer words, the O(LMN) algorithm remains linear in N, it just has a different slope, so the tradeoff moves slightly to the right (130 for length=5).