q-gram approximate matching optimisations

前端 未结 4 2000
我寻月下人不归
我寻月下人不归 2021-02-03 14:29

I have a table containing 3 million people records on which I want to perform fuzzy matching using q-grams (on surname for instance). I have created a table of 2-grams linking t

4条回答
  •  小蘑菇
    小蘑菇 (楼主)
    2021-02-03 15:13

    I've been looking into fuzzy string matching lately, so even at the risk of answering to an abandoned question, here goes. Hope you find this useful.

    I suppose you're only interested in the strings for which the edit distance is smaller than a given value. And your q-grams (or n-grams) look like this

    2-grams for "foobar": {"fo","oo","ob","ba","ar"}
    
    1. You could use positional q-grams:

      "foobar": {("fo",1),("oo",2),("ob",3),("ba",4),("ar",5)}
      

      The positional information can be used to determine if a matching q-gram is really a "good match".

      For example, if you're searching for "foobar" with maximum edit distance of 2, this means that you're only interested in words where

      2-gram "fo" exists in with position from 1 to 3 or
      2-gram "oo" exists in with position from 2 to 4 or
      ... and so on
      

      String "barfoo" doesn't get any matches because the positions of the otherwise matching 2-grams differ by 3.

    2. Also, it might be useful to use the relation between edit distance and the count of matching q-grams. The intution is that since

      a string s has len(s)-q+1 q-grams

      and

      a single edit operation can affect at most q q-grams,

      we can deduce that

      strings s1 and s2 within edit distance of d have at least max(len(s1),len(s2))-q+1-qk matching non-positional q-grams.

      If you're searching for "foobar" with an maximum edit distance of 2, a matching 7-character string (such as "fotocar") should contain at least two common 2-grams.

    3. Finally, the obvious thing to do is to filter by lenght. The edit distance between two strings is at least the difference of the lengths of the strings. For example if your threshold is 2 and you search for "foobar", "foobarbar" cannot obviously match.

    See http://pages.stern.nyu.edu/~panos/publications/deb-dec2001.pdf for more and some pseudo SQL.

提交回复
热议问题