q-gram approximate matching optimisations

前端未结

关注

 4  2015

我寻月下人不归 2021-02-03 14:29

I have a table containing 3 million people records on which I want to perform fuzzy matching using q-grams (on surname for instance). I have created a table of 2-grams linking t

4条回答

小蘑菇 (楼主)

2021-02-03 15:13
I've been looking into fuzzy string matching lately, so even at the risk of answering to an abandoned question, here goes. Hope you find this useful.

I suppose you're only interested in the strings for which the edit distance is smaller than a given value. And your q-grams (or n-grams) look like this
```
2-grams for "foobar": {"fo","oo","ob","ba","ar"}
```
1. You could use positional q-grams:
```
"foobar": {("fo",1),("oo",2),("ob",3),("ba",4),("ar",5)}
```
  The positional information can be used to determine if a matching q-gram is really a "good match".
  
  For example, if you're searching for "foobar" with maximum edit distance of 2, this means that you're only interested in words where
```
2-gram "fo" exists in with position from 1 to 3 or
2-gram "oo" exists in with position from 2 to 4 or
... and so on
```
  String "barfoo" doesn't get any matches because the positions of the otherwise matching 2-grams differ by 3.
2. Also, it might be useful to use the relation between edit distance and the count of matching q-grams. The intution is that since
  
  a string s has len(s)-q+1 q-grams
  
  and
  
  a single edit operation can affect at most q q-grams,
  
  we can deduce that
  
  strings s1 and s2 within edit distance of d have at least max(len(s1),len(s2))-q+1-qk matching non-positional q-grams.
  
  If you're searching for "foobar" with an maximum edit distance of 2, a matching 7-character string (such as "fotocar") should contain at least two common 2-grams.
3. Finally, the obvious thing to do is to filter by lenght. The edit distance between two strings is at least the difference of the lengths of the strings. For example if your threshold is 2 and you search for "foobar", "foobarbar" cannot obviously match.
See http://pages.stern.nyu.edu/~panos/publications/deb-dec2001.pdf for more and some pseudo SQL.
0 讨论(0)

查看其它4个回答
发布评论:

提交评论
- 加载中...