SQL Fuzzy Join - MSSQL

梦想的初衷 提交于 2019-12-03 20:57:25

Here is how this could be done using Levenshtein Distance:

Create this function:(Execute this first)

CREATE FUNCTION ufn_levenshtein(@s1 nvarchar(3999), @s2 nvarchar(3999))
RETURNS int
AS
BEGIN
 DECLARE @s1_len int, @s2_len int
 DECLARE @i int, @j int, @s1_char nchar, @c int, @c_temp int
 DECLARE @cv0 varbinary(8000), @cv1 varbinary(8000)

 SELECT
  @s1_len = LEN(@s1),
  @s2_len = LEN(@s2),
  @cv1 = 0x0000,
  @j = 1, @i = 1, @c = 0

 WHILE @j <= @s2_len
  SELECT @cv1 = @cv1 + CAST(@j AS binary(2)), @j = @j + 1

 WHILE @i <= @s1_len
 BEGIN
  SELECT
   @s1_char = SUBSTRING(@s1, @i, 1),
   @c = @i,
   @cv0 = CAST(@i AS binary(2)),
   @j = 1

  WHILE @j <= @s2_len
  BEGIN
   SET @c = @c + 1
   SET @c_temp = CAST(SUBSTRING(@cv1, @j+@j-1, 2) AS int) +
    CASE WHEN @s1_char = SUBSTRING(@s2, @j, 1) THEN 0 ELSE 1 END
   IF @c > @c_temp SET @c = @c_temp
   SET @c_temp = CAST(SUBSTRING(@cv1, @j+@j+1, 2) AS int)+1
   IF @c > @c_temp SET @c = @c_temp
   SELECT @cv0 = @cv0 + CAST(@c AS binary(2)), @j = @j + 1
 END

 SELECT @cv1 = @cv0, @i = @i + 1
 END

 RETURN @c
END

(Function developped by Joseph Gama)

And then simply use this query to get matches

SELECT A.Customer,
       b.ID,
       b.Customer
FROM #POTENTIALCUSTOMERS a
     LEFT JOIN #ExistingCustomers b ON dbo.ufn_levenshtein(REPLACE(A.Customer, ' ', ''), REPLACE(B.Customer, ' ', '')) < 5;

Complete Script after you create that function:

IF OBJECT_ID('tempdb..#ExistingCustomers') IS NOT NULL
    DROP TABLE #ExistingCustomers;

CREATE TABLE #ExistingCustomers
(Customer VARCHAR(255),
 ID       INT
);

INSERT INTO #ExistingCustomers
VALUES
('Ed''s Barbershop',
 1002
);

INSERT INTO #ExistingCustomers
VALUES
('GroceryTown',
 1003
);

INSERT INTO #ExistingCustomers
VALUES
('Candy Place',
 1004
);

INSERT INTO #ExistingCustomers
VALUES
('Handy Man',
 1005
);

IF OBJECT_ID('tempdb..#POTENTIALCUSTOMERS') IS NOT NULL
    DROP TABLE #POTENTIALCUSTOMERS;

CREATE TABLE #POTENTIALCUSTOMERS(Customer VARCHAR(255));

INSERT INTO #POTENTIALCUSTOMERS
VALUES('Eds Barbershop');

INSERT INTO #POTENTIALCUSTOMERS
VALUES('Grocery Town');

INSERT INTO #POTENTIALCUSTOMERS
VALUES('Candy Place');

INSERT INTO #POTENTIALCUSTOMERS
VALUES('Handee Man');

INSERT INTO #POTENTIALCUSTOMERS
VALUES('Beauty Salon');

INSERT INTO #POTENTIALCUSTOMERS
VALUES('The Apple Farm');

INSERT INTO #POTENTIALCUSTOMERS
VALUES('Igloo Ice Cream');

INSERT INTO #POTENTIALCUSTOMERS
VALUES('Ride-a-Long Bikes');

SELECT A.Customer,
       b.ID,
       b.Customer
FROM #POTENTIALCUSTOMERS a
     LEFT JOIN #ExistingCustomers b ON dbo.ufn_levenshtein(REPLACE(A.Customer, ' ', ''), REPLACE(B.Customer, ' ', '')) < 5;

Here you can find a T-SQL example at http://www.kodyaz.com/articles/fuzzy-string-matching-using-levenshtein-distance-sql-server.aspx

Trying to do this within SQL is going to be a continual challenge and one that you are not likely to win. You can go quite far by stripping out non a-z or 0-9 characters or trying something like Soundex or Metaphone matching or Levenshtein Distance but there will always be another edge case that you didn't pick up in all your replacing, wild carding, phoneticising or plain fudging.

If you do manage to find something that works with enough accuracy for you, you will then hit performance problems.

In short, your best hope is going way down the SQLCLR route and learning a lot of C# on the way or not really bothering at all and simply cleaning your data at source or creating a lookup table of 'clean' names that will require constant maintenance as new variants come in.

One way is to use the help of REPLACE function in both side of the comparing columns.

SELECT a.Customer, b.ID
FROM PotentialCustomers a 
  LEFT JOIN ExistingCustomers B
     ON (LTRIM(RTRIM(REPLACE(REPLACE(REPLACE(a.Customer,' ',''),'-',''),'''',''))) = LTRIM(RTRIM(REPLACE(REPLACE(REPLACE(b.Customer,' ',''),'-',''),'''','')))) 
        OR (a.Customer LIKE '%'+b.Customer+'%') 
        OR (b.Customer LIKE '%'+a.Customer+'%') 

You need more than 1 field to accomplish this with any effectiveness. Do you have things like city, state, zip, address, etc? You can then create a multipart key with those fields concatenated. You may want to truncate some to the first 5 characters or something but the more you vary the more false positives you get.

I’ve done this and created a couple keys being less restrictive with each key. Then match trying each key and assigning a match grade when you find matches.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!