q-gram approximate matching optimisations

前端 未结 4 2012
我寻月下人不归
我寻月下人不归 2021-02-03 14:29

I have a table containing 3 million people records on which I want to perform fuzzy matching using q-grams (on surname for instance). I have created a table of 2-grams linking t

4条回答
  •  悲&欢浪女
    2021-02-03 15:11

    You have certainly seen the fuzzy text searches everywhere. For example you type "stck" but you actually mean "stack"! Ever wondered how does this stuff work?

    There are plenty of algorithms to do fuzzy text matching, each with its own pro and cons. The most famous ones are edit distance and qgram. I want to focus on qgrams today and implement a sample.

    Basically qgrams are the most suitable fuzzy string matching algorithm for relational databases. It is pretty simple. "q" in qgram will be replaced with a number like 2-gram or 3-gram or even 4-gram.

    2-gram means that every word is broken into a set of two character grams. "Stack" will be broken into a set of {"st", "ta", "ac", "ck"} or "database" will be broken into {"da","at","ta","ab","ba","as","se"}.

    Once the words are broken into 2-grams, we can search the database for a set of values instead of one string. For example if user mistyped "stck", any search for "stck" will not match "stack" because "a" is missing, but the 2-gram set {"st","tc","ck"} has 2 rows in common with the 2-gram set of stack! Bingo we found a pretty close match. It has nothing in common with the 2-gram set of database and only 1 in common with the 2-gram set of "stat" so we can easily suggest the user that he meant to type: first "stack" or second, "star".

    Now let's implement it using Sql Server: Assume a hypothetical words dataset. You need to have a many to many relationship between 2grams and words.

    CREATE TABLE Grams(twog char(2), wordId int, PRIMARY KEY (twog, wordId))
    

    Grams table should be clustered on first twog, and then the wordId for performance. When you query a word (e.g. stack) you put the grams in a temp table. First lets create a few million dummy records.

    --make millions of 2grams
     DECLARE @i int =0
     WHILE (@i<5000000)
     BEGIN
    -- a random 2gram
     declare @rnum1 char = CHAR(CAST(RAND()*28 AS INT)+97)
     declare @rnum2 char = CHAR(CAST(RAND()*28 AS INT)+97)
     INS... INTO Grams (twog, wordId) VALUES ( @rnum1 + @rnum2, CAST(RAND()*100000 AS int))
     END
    

    Now lets query the word "stack" which will be broken to: {'st','ta','ac','ck'} two grams.

    DECLARE @word TABLE(twog char(2)) -- 'stack'
     INS... INTO @word VALUES ('st'), ('ta'), ('ac'), ('ck')
    
    select wordId, count(*) from @word w inner join Grams g ON w.twog = g.twog
     GROUP BY wordId
    

    You should make sure that Sql Server uses a bunch of clustered index seeks (or loockups) for running this query. It should be the natural choice but sometimes statistics may be corrupted or out of date and SqlServer may decide that a full scan is cheaper. This usually happens if it does not know the cardinality of left side table, for example SqlServer may assume that @word table is massive and millions of loockups is going to be more expensive than a full index scan.

提交回复
热议问题