I have a table containing 3 million people records on which I want to perform fuzzy matching using q-grams (on surname for instance). I have created a table of 2-grams linking t
I've been looking into fuzzy string matching lately, so even at the risk of answering to an abandoned question, here goes. Hope you find this useful.
I suppose you're only interested in the strings for which the edit distance is smaller than a given value. And your q-grams (or n-grams) look like this
2-grams for "foobar": {"fo","oo","ob","ba","ar"}
You could use positional q-grams:
"foobar": {("fo",1),("oo",2),("ob",3),("ba",4),("ar",5)}
The positional information can be used to determine if a matching q-gram is really a "good match".
For example, if you're searching for "foobar" with maximum edit distance of 2, this means that you're only interested in words where
2-gram "fo" exists in with position from 1 to 3 or
2-gram "oo" exists in with position from 2 to 4 or
... and so on
String "barfoo" doesn't get any matches because the positions of the otherwise matching 2-grams differ by 3.
Also, it might be useful to use the relation between edit distance and the count of matching q-grams. The intution is that since
a string s has len(s)-q+1 q-grams
and
a single edit operation can affect at most q q-grams,
we can deduce that
strings s1 and s2 within edit distance of d have at least max(len(s1),len(s2))-q+1-qk matching non-positional q-grams.
If you're searching for "foobar" with an maximum edit distance of 2, a matching 7-character string (such as "fotocar") should contain at least two common 2-grams.
See http://pages.stern.nyu.edu/~panos/publications/deb-dec2001.pdf for more and some pseudo SQL.