q-gram approximate matching optimisations

前端 未结 4 2010
我寻月下人不归
我寻月下人不归 2021-02-03 14:29

I have a table containing 3 million people records on which I want to perform fuzzy matching using q-grams (on surname for instance). I have created a table of 2-grams linking t

4条回答
  •  忘了有多久
    2021-02-03 14:55

    I have a simple improvement which will not eliminate the scan, but speed it up if you use 2-grams or 3-grams only: replace the letters by numbers. Most SQL engines work a lot faster when comparing numbers.

    Example: our source table contains text entries in one column. We create a temp table where we split the names in 2-grams using a

    SELECT SUBSTRING (column, 1,2) as gram, 1 as position FROM sourcetable
    UNION  
    SELECT SUBSTRING (column, 2,2) as gram, 2 as position FROM sourcetable
    UNION
    SELECT SUBSTRING (column, 3,2) as gram, 3 as position FROM sourcetable
    
    etc. 
    

    This should run in a loop where i=0 and j=the max size of a source entry.

    Then we prepare a mapping table which contains all possible 2-letter grams and include a IDENTITY (1,1) column called gram_id. We may sort the grams by frequency in the English dictionary and eliminate the most infrequent grams (like 'kk' or 'wq') - this sorting may take some time and research but it will assign the smallest numbers to the most frequent grams, which will then improve performance if we can limit the number of grams to 255 because then we can use a tinyint column for the gram_id.

    Then we rebuild another temp table from the first one, where we use the gram_id instead of the gram. That becomes the master table. We create an index on the gram_id column and on the position column.

    Then when we have to compare a text string to the master table, we first split the text string split it into 2-grams, then replace the 2-grams by their gram_id (using the mapping table), and compare them to the one of the master table

    That makes a lot of comparisons, but most of them are 2-digit integers, which is very quick.

提交回复
热议问题