Fast Hamming distance scoring

There is a database with N fixed length strings. There is a query string of the same length. The problem is to fetch first k strings from the database that have the smallest Hamming distance to q.

N is small (about 400), strings are long, fixed in length. Database doesn't change, so we can pre-compute indexes. Queries vary strongly, caching and/or pre-computation is not an option. There are lots of them per second. We need always k results, even if k-1 results have match 0 (sorting on Hamming distance and take first k, so locality sensitive hashing and similar approaches won't do). kd-tree and similar space partitioning will probably perform worser than linear search (strings can be very long). BK-tree is currently best choice, but it is still slow and complicated than it needs to be.

It feels like there is an algorithm, which will build an index, which will discard most of the entries in very few steps, leaving k <= t << N entries to compute real Hamming distance.

People suggesting fuzzy string matching based on Levenstein distance - thanks, but the problem is much simpler. Generalized distance metric based approaches (like BK-trees) are good, but maybe there something utilizing the facts described above (small DB/long fixed size strings, simple Hamming distance)

Links, keywords, papers, ideas? =)

This seems like a task where a Vantage Point (VP tree) might work... since the hamming distance should satisfy the triangle inequality theorem, you should be able to apply it... its also good for identifying k-closest. I've seen it in image indexing database setups... you might check section 5 of this paper as an example of what I'm talking about (albeit in a different field).

All hamming distances can be produced in O(K^2/D) using the python code below.
This is faster in some cases than the trivial code which is O(N*K).

Where N is the number of fixed length strings
K is the length of each string
and D is the size of the dictionary.

# DATABASE is a tuple of the strings
# eg. ('asdfjjajwi...', 'hsjsiei...', ...)

# SINGLE is the string you are matching
# eg. 'jfjdkaks...'

SIZE_OF_STRING = 5000
NUMBER_OF_STRINGS = 400
FIRST_K_REQUIRED = 100

def setup_index():
  index = []
  for x in xrange(SIZE_OF_STRING):
    index_dict = {}
    for y in xrange(NUMBER_OF_STRINGS):
      temp = index_dict.get(DATABASE[y][x], [])
      temp.append(y)
      index_dict[DATABASE[y][x]] = temp
    index.append(index_dict)
  return index

index = setup_index()

output = []
for x in xrange(NUMBER_OF_STRINGS):
  output.append([SIZE_OF_STRING, x])

for key, c in enumerate(SINGLE):
  for x in index[key][c]:
    output[x][0] -= 1

output.sort()
print output[:FIRST_K_REQUIRED]

This is a faster method only when SIZE_OF_STRING / DICTIONARY_SIZE < NUMBER_OF_STRINGS.

Hope this helps.

EDIT: The complexity of the above code is incorrect.

The Hamming Distances can be produced in O(N*K/D) on average.
This is faster in ALL cases than the trivial code which is O(N*K).

Where N is the number of fixed length strings
K is the length of each string
and D is the size of the dictionary.

From my understanding, BK trees are great for finding all the strings at most K "differences" from the query string. This is a different question than finding the X closest elements. This is probably the reason for the performance problems.

My first inclination is that if speed is really important then the ultimate goal should be to construct a deterministic finite automaton (DFA) to handle this problem. Donald Knuth worked on a related problem and developed a method called Trie which simulates a DFA. This method is especially nice when you have many possible words in the starting dictionary to search through. I think your problem could be an interesting extension of this work. In his original work the goal of the DFA was to try and match an input string with words in the dictionary. I believe the same thing could be done for this problem, but instead returning the K closest items to the query. In essence we are expanding the definition of an accepting state.

Whether this is practical to do depends on the number of accepting states that need to be included. I think the key idea is that of compatible sets. For instance, imagine on a number line that we have the elements 1,2,3,4,5 and for any query want the two closest elements. The element 2 can be in two possible sets (1,2) or (2,3) but 2 can never be a set with 4 or 5. It is late so I am not sure the best way to construct such as DFA at the moment. Seems like there could be a decent paper in the answer.

eris

This problem does seem strongly related to the Knuth "trie" algorithm for which there are several highly optimal special solutions - largely related to their cache coherence and CPU instruction assisted acceleration (bitwise trie).

A trie is an excellent solution for a related issue - the similarity of the beginning of the string, which of course makes it a perfect solution for finding the set of minimally unique string solutions from any point starting at the string origin. The bitwise trie in this case has an average performance of O(1) in practice, worst case O(m) where M is the key length. Overall its performance for search, insert and delete is the same as a hash, except it doesn't have the collision issues of a pure hashed array.

I bumped into this this question because I was searching for information on bitwise tries and realized their similarity to certain hamming algorithms, so maybe this class of algorithms would be a fruitful area of study for you. Good luck.

来源：https://stackoverflow.com/questions/3097918/fast-hamming-distance-scoring

标签

sorting

pattern-matching

hamming-distance