问题
I am looking for an algorithm and storage schema to do string matching over a bigger than memory dictionary.
My initial attempt, inspired from https://swtch.com/~rsc/regexp/regexp4.html, was to store trigams of every word of the dictionary for instance the word apple
is split into $ap
, app
, ppl
, ple
and le$
at index time. All of those trigram as associated with the word they came from.
Then I query time, I do the same for the input string that must be matched. I look up every of those trigram in the database and store in the candidate words in mapping associated with the number of matching trigrams in them. Then, I proceed to compute the levenshtein distance between every candidate and apply the following formula:
score(query, candidate) = common_trigram_number(query, candidate) - abs(levenshtein(query, candidate))
There is two problems with this approach, first the candidate selection is too broad. Second the levenshtein distance is too slow to compute.
Fixing the first, could make the second useless to optimize.
I thought about another approach, at index time, instead of storing trigrams, I will store words (possibly associated with frequency). At query time, I could lookup successive prefixes of the query
string and score using levenshtein and frequency.
In particular, I am not looking for an algorithm that gives me strings at a distance of 1, 2 etc... I would like to just have a paginated list of more-or-less relevant words from the dictionary. The actual selection is made by the user.
Also it must be possible to represent it in terms of ordered key-value store like rocksdb or wiredtiger.
回答1:
simhash captures similarity between (small) strings. But it does not really solve the problem of querying most similar string in a bigger than RAM dataset. I think, the original paper recommends to index some permutations, it requires a lot of memory and it does not take advantage of the ordered nature of OKVS.
I think I found a hash that allows to capture similarity inside the prefix of the hash:
In [1]: import fuzz
In [2]: hello = fuzz.bbkh("hello")
In [3]: helo = fuzz.bbkh("helo")
In [4]: hellooo = fuzz.bbkh("hellooo")
In [5]: salut = fuzz.bbkh("salut")
In [6]: len(fuzz.lcp(hello.hex(), helo.hex())) # Longest Common Prefix
Out[6]: 213
In [7]: len(fuzz.lcp(hello.hex(), hellooo.hex()))
Out[7]: 12
In [8]: len(fuzz.lcp(hello.hex(), salut.hex()))
Out[8]: 0
After small test over wikidata labels it seems to give good results:
$ time python fuzz.py query 10 france
* most similar according to bbk fuzzbuzz
** france 0
** farrance -2
** freande -2
** defrance -2
real 0m0.054s
$ time python fuzz.py query 10 frnace
* most similar according to bbk fuzzbuzz
** farnace -1
** france -2
** fernacre -2
real 0m0.060s
$ time python fuzz.py query 10 beglium
* most similar according to bbk fuzzbuzz
** belgium -2
real 0m0.047s
$ time python fuzz.py query 10 belgium
* most similar according to bbk fuzzbuzz
** belgium 0
** ajbelgium -2
real 0m0.059s
$ time python fuzz.py query 10 begium
* most similar according to bbk fuzzbuzz
** belgium -1
** beijum -2
real 0m0.047s
Here is an implementation:
HASH_SIZE = 2**10
BBKH_LENGTH = int(HASH_SIZE * 2 / 8)
chars = ascii_lowercase + "$"
ONE_HOT_ENCODER = sorted([''.join(x) for x in product(chars, chars)])
def ngram(string, n):
return [string[i:i+n] for i in range(len(string)-n+1)]
def integer2booleans(integer):
return [x == '1' for x in bin(integer)[2:].zfill(HASH_SIZE)]
def chunks(l, n):
"""Yield successive n-sized chunks from l."""
for i in range(0, len(l), n):
yield l[i:i + n]
def merkletree(booleans):
assert len(booleans) == HASH_SIZE
length = (2 * len(booleans) - 1)
out = [False] * length
index = length - 1
booleans = list(reversed(booleans))
while len(booleans) > 1:
for boolean in booleans:
out[index] = boolean
index -= 1
new = []
for (right, left) in chunks(booleans, 2):
value = right or left
new.append(value)
booleans = new
return out
def bbkh(string):
integer = 0
string = "$" + string + "$"
for gram in ngram(string, 2):
hotbit = ONE_HOT_ENCODER.index(gram)
hotinteger = 1 << hotbit
integer = integer | hotinteger
booleans = integer2booleans(integer)
tree = merkletree(booleans)
fuzz = ''.join('1' if x else '0' for x in tree)
buzz = int(fuzz, 2)
hash = buzz.to_bytes(BBKH_LENGTH, 'big')
return hash
def bbkh(string):
integer = 0
string = "$" + string + "$"
for gram in ngram(string, 2):
hotbit = ONE_HOT_ENCODER.index(gram)
hotinteger = 1 << hotbit
integer = integer | hotinteger
booleans = integer2booleans(integer)
tree = merkletree(booleans)
fuzz = ''.join('1' if x else '0' for x in tree)
buzz = int(fuzz, 2)
hash = buzz.to_bytes(BBKH_LENGTH, 'big')
return hash
def lcp(a, b):
"""Longest Common Prefix between a and b"""
out = []
for x, y in zip(a, b):
if x == y:
out.append(x)
else:
break
return ''.join(out)
Note: computing simhash instead of the input string, only works with a bag of lemma or stem, indeed it finds similar documents.
来源:https://stackoverflow.com/questions/58065020/how-to-do-fuzzy-string-matching-of-bigger-than-memory-dictionary-in-an-ordered-k