mySQL: Using Levenshtein distance to find duplicates in 20,000 rows

问题

I basically have a two column table containing a primary key and names of companies with about 20,000 rows.

My task is to find all duplicate entries.

I originally tried using soundex, but it would match companies that were completely different, just because they had similar first words. So this led me on to the levenshtein distance algorithm.

The problem is, the query takes an indefinite amount of time. I've left it for about 10 hours now, it still hasn't responded.

Here is the query:

SELECT * 
FROM `Companies` a, `Companies` b 
WHERE levenshtein(a.name, b.name)<5 
AND a.id<>b.id

And here is the levenshtein function I'm using (got it from this post)

DELIMITER $$
CREATE FUNCTION levenshtein( s1 VARCHAR(255), s2 VARCHAR(255) )
RETURNS INT
DETERMINISTIC
BEGIN
DECLARE s1_len, s2_len, i, j, c, c_temp, cost INT;
DECLARE s1_char CHAR;
-- max strlen=255
DECLARE cv0, cv1 VARBINARY(256);
SET s1_len = CHAR_LENGTH(s1), s2_len = CHAR_LENGTH(s2), cv1 = 0x00, j = 1, i = 1, c = 0;
IF s1 = s2 THEN
RETURN 0;
ELSEIF s1_len = 0 THEN
RETURN s2_len;
ELSEIF s2_len = 0 THEN
RETURN s1_len;
ELSE
WHILE j <= s2_len DO
SET cv1 = CONCAT(cv1, UNHEX(HEX(j))), j = j + 1;
END WHILE;
WHILE i <= s1_len DO
SET s1_char = SUBSTRING(s1, i, 1), c = i, cv0 = UNHEX(HEX(i)), j = 1;
WHILE j <= s2_len DO
SET c = c + 1;
IF s1_char = SUBSTRING(s2, j, 1) THEN
SET cost = 0; ELSE SET cost = 1;
END IF;
SET c_temp = CONV(HEX(SUBSTRING(cv1, j, 1)), 16, 10) + cost;
IF c > c_temp THEN SET c = c_temp; END IF;
SET c_temp = CONV(HEX(SUBSTRING(cv1, j+1, 1)), 16, 10) + 1;
IF c > c_temp THEN
SET c = c_temp;
END IF;
SET cv0 = CONCAT(cv0, UNHEX(HEX(c))), j = j + 1;
END WHILE;
SET cv1 = cv0, i = i + 1;
END WHILE;
END IF;
RETURN c;
END$$
DELIMITER ;

What can I do to speed up the query?

回答1:

I know at least one optimization that might cut the running time in half:

AND a.id < b.id

This prevents you from testing a=1, b=2 when you've already tested a=2, b=1.

It's still gonna be O(n^2) though, but I can't see how you can do much about that.

回答2:

How similar are the names of the duplicate entries?

If they are exact you could just group by the name:

SELECT REPLACE(name, ' ', '') as name, count(id) as totalDuplicates 
FROM `Companies`
GROUP BY REPLACE(name, ' ', '')

Anything with a count greater than 1 is a duplicate

回答3:

So I implemented a bunch of suggestions in this thread to reduce my query time.

I indexed the name collumn, changed a.id <> b.id to a.id < b.id to reduce recomparing already compared rows, and added LEFT(a.name, 3) = LEFT(b.name, 3) to prevent executing the heavy levenshtein function on rows that can be easily excluded by the first 3 characters.

This was the query I used:

SELECT * 
FROM `Companies` a, `Companies` b  
WHERE LEFT(a.name, 3) = LEFT(b.name, 3) 
AND a.id < b.id 
AND levenshtein(a.name, b.name)<3

This took about 2 hours to complete, and gave me 964 results. After that, I exported the results as a .csv and imported them into another table, TABLE 2. TABLE 2 is structured like this:

COL 1, COL 2, COL 3, COL 4
a.id, a.name, b.id, b.name

I noticed that there were a lot of results in TABLE 2 that were actually different companies, but were only a couple of characters apart, making the levinshtein distance ineffective at sorting them. For example: "Body FX", "Body Fit", or "Baxco", "Baxyl".

I attempted to filter out more names by comparing RIGHT() on the last 2 characters of the string, but ran into problems as some names were plural, like "Aroostock Medical Center" and "Aroostock Medical Centers". So I wrote my own RIGHT_PLURAL() function that ignored the plural characters.

DROP FUNCTION IF EXISTS RIGHT_PLURAL;
DELIMITER $$
CREATE FUNCTION RIGHT_PLURAL(input VARCHAR(50), right_input INT)
    RETURNS VARCHAR(50)
BEGIN
    DECLARE length INT;
    SET length = LENGTH(input);

    IF RIGHT(input, 2)="'s" THEN
        RETURN SUBSTR(input, length-right_input-1, right_input);
    ELSEIF RIGHT(input, 1)="s" THEN
        RETURN SUBSTR(input, length-right_input, right_input);
    ELSE
        RETURN RIGHT(input, right_input);
    END IF;
END;
$$
DELIMITER ;

I ran

SELECT * 
FROM  `TABLE 2` 
WHERE RIGHT_PLURAL(
`COL 2` , 2
) = RIGHT_PLURAL(
`COL 4` , 2
)

and was down to 893 duplicates. I was satisfied. I copied over the result set to TABLE 3, and ran the following.

DELETE 
FROM `Companies` 
WHERE `id` IN ( SELECT `COL 1` FROM `TABLE 3` )

My database was now largely duplicate free! The only few strays left were due to serious miss-spellings of names.

回答4:

Google released a tool for cleaning up messy data called 'Refine', http://code.google.com/p/google-refine/ Maybe you could try it and see if it will help in this case.

来源：https://stackoverflow.com/questions/24107101/mysql-using-levenshtein-distance-to-find-duplicates-in-20-000-rows

标签

mysql

sql

levenshtein-distance