I have a table containing 3 million people records on which I want to perform fuzzy matching using q-grams (on surname for instance). I have created a table of 2-grams linking t
You have certainly seen the fuzzy text searches everywhere. For example you type "stck" but you actually mean "stack"! Ever wondered how does this stuff work?
There are plenty of algorithms to do fuzzy text matching, each with its own pro and cons. The most famous ones are edit distance and qgram. I want to focus on qgrams today and implement a sample.
Basically qgrams are the most suitable fuzzy string matching algorithm for relational databases. It is pretty simple. "q" in qgram will be replaced with a number like 2-gram or 3-gram or even 4-gram.
2-gram means that every word is broken into a set of two character grams. "Stack" will be broken into a set of {"st", "ta", "ac", "ck"} or "database" will be broken into {"da","at","ta","ab","ba","as","se"}.
Once the words are broken into 2-grams, we can search the database for a set of values instead of one string. For example if user mistyped "stck", any search for "stck" will not match "stack" because "a" is missing, but the 2-gram set {"st","tc","ck"} has 2 rows in common with the 2-gram set of stack! Bingo we found a pretty close match. It has nothing in common with the 2-gram set of database and only 1 in common with the 2-gram set of "stat" so we can easily suggest the user that he meant to type: first "stack" or second, "star".
Now let's implement it using Sql Server: Assume a hypothetical words dataset. You need to have a many to many relationship between 2grams and words.
CREATE TABLE Grams(twog char(2), wordId int, PRIMARY KEY (twog, wordId))
Grams table should be clustered on first twog, and then the wordId for performance. When you query a word (e.g. stack) you put the grams in a temp table. First lets create a few million dummy records.
--make millions of 2grams
DECLARE @i int =0
WHILE (@i<5000000)
BEGIN
-- a random 2gram
declare @rnum1 char = CHAR(CAST(RAND()*28 AS INT)+97)
declare @rnum2 char = CHAR(CAST(RAND()*28 AS INT)+97)
INS... INTO Grams (twog, wordId) VALUES ( @rnum1 + @rnum2, CAST(RAND()*100000 AS int))
END
Now lets query the word "stack" which will be broken to: {'st','ta','ac','ck'} two grams.
DECLARE @word TABLE(twog char(2)) -- 'stack'
INS... INTO @word VALUES ('st'), ('ta'), ('ac'), ('ck')
select wordId, count(*) from @word w inner join Grams g ON w.twog = g.twog
GROUP BY wordId
You should make sure that Sql Server uses a bunch of clustered index seeks (or loockups) for running this query. It should be the natural choice but sometimes statistics may be corrupted or out of date and SqlServer may decide that a full scan is cheaper. This usually happens if it does not know the cardinality of left side table, for example SqlServer may assume that @word table is massive and millions of loockups is going to be more expensive than a full index scan.