I want to build a graph from a list of words with Hamming distance of (say) 1, or to put it differently, two words are connected if they only differ from one letter (lo
Here is linear O(N) algorithm, but with big constant factor (R * L * 2). R is radix (for latin alphabet it is 26). L is a medium length of word. 2 is a factor of adding/replacing wildcard character. So abc and aac and abca are two ops wich leads to hamming distance of 1.
It is written in Ruby. And for 240k words it takes ~250Mb RAM and 136 seconds on average hardware
Blueprint of graph implementation
class Node
attr_reader :val, :edges
def initialize(val)
@val = val
@edges = {}
end
def <<(node)
@edges[node.val] ||= true
end
def connected?(node)
@edges[node.val]
end
def inspect
"Val: #{@val}, edges: #{@edges.keys * ', '}"
end
end
class Graph
attr_reader :vertices
def initialize
@vertices = {}
end
def <<(val)
@vertices[val] = Node.new(val)
end
def connect(node1, node2)
# print "connecting #{size} #{node1.val}, #{node2.val}\r"
node1 << node2
node2 << node1
end
def each
@vertices.each do |val, node|
yield [val, node]
end
end
def get(val)
@vertices[val]
end
end
The algorithm itself
CHARACTERS = ('a'..'z').to_a
graph = Graph.new
# ~ 240 000 words
File.read("/usr/share/dict/words").each_line.each do |word|
word = word.chomp
graph << word.downcase
end
graph.each do |val, node|
CHARACTERS.each do |char|
i = 0
while i <= val.size
node2 = graph.get(val[0, i] + char + val[i..-1])
graph.connect(node, node2) if node2
if i < val.size
node2 = graph.get(val[0, i] + char + val[i+1..-1])
graph.connect(node, node2) if node2
end
i += 1
end
end
end