Fuzzy matching a string in SQL

问题

I have a User table, that has id, first_name, last_name, street_address, city, state, zip-code, firm, user_identifier, created_at, update_at.

This table has a lot of duplication like the same users have been entered multiple times as a new user, so example


id  first_name  last_name  street_address  user_identifier
---------------------------------------------------------
11   Mary       Doe        123 Main Ave     M2111111
---------------------------------------------------------
21  Mary        Doe        123 Main Ave     M2344455
---------------------------------------------------------
13  Mary Esq    Doe        123 Main Ave     M1233444

I would like to know if there is a way of doing fuzzy matching on this table.

Basically I would like to find all the users that have the same name, same address but can be with a slight difference, maybe the address is the same but has different apartment number, or has a middle name and the other duplicates don't.

I was thinking to create a new column that has concatenated first_name, last_name, street_address and do a fuzzy match on that column.

I tried levenshtein distance on concatenated first_name and last_name as full_name but doesn't seem to catch up the name that has a middle name

select * from users
where levenshtein('Mary Doe', full_name) <=1;

I am using a Databricks and PostgreSQL.

Thank you!

回答1:

In postgres you can use fuzzystrmatch package. It provies a levenshtein function, that returns distance between two texts, you can then perform fuzzy matching with the following exemplary predicate:

where levenshtein(street_address, '123 Main Avex') <= 1

This will match all records, because the distance between '123 Main Ave' and '123 Main Avex' is 1 (1 insertion).

Of course, value 1 here is just an example and will perform matching quite strictly (difference by only one character). You should either use larger number or, what @IVO GELOV sugests - use relative distance (distance divided by the length).

回答2:

If you get to the point where Levenshtein ("edit distance") isn't capturing all of the matches you need, I'd strongly encourage you to check out pg_tgrm. It. Is. Awesome.

postgresql.org/docs/current/pgtrgm.html.

As an example of why to use trigrams, they let you pick up cases where first_name and last_name are reversed, a relatively common error. Levenshtein isn't well matched to spotting that as all it does is transform the one string into another, and count the number of moves required. When you've got elements swapped, they increase the distance quite a bit and make the match less likely. As an example, pretend that you have a record where the right full name is "David Adams". It's pretty common to find the last name as "Adam", and to find first and last names reversed. So, that's three plausible forms for a simple name. How does Levenshtein perform compared with the Postgres trigram implementation? For this, I compared levenshtein(string 1, string 2) with similarity(string 1, string 2). As noted above, Levenshtein is a count where a higher score means less similar. To normalize the scores to a 0-1 value where 1 = identical, I divided it by the max full name length, as suggested above, and subtracted it from 1. That last bit is to make the figures directly comparable to a similarity() score. (Otherwise, you've got numbers where 1 means opposite things.)

Here's are some simple results, rounded a bit for clarity

Row 1           Row 2        Levenshtein()  Levensthein %   Similarity %
David Adams     Adam David              10              9             77
Adam David      Adams David              1             91             77
Adams David     David Adams             10              9            100

As you can see, the similarity() score performs better in a lot of cases, even with this simple example. Then again, Levenshtein feels better in one case. It's not rare to combine techniques. If you do that, normalize the scales to save yourself some headache.

But all of this is made a lot easier if you've got cleaner data to start with. If one of your problems is with inconsistent abbreviations and punctuation, Levenshtein can be a poor match. For this reason, it's helpful to perform address standardization before duplicate matching.

For what it's worth (a lot), trigrams in Postgres can use indexes. It can be a good bet to try and find a technique to safely reduce candidates with an indexed search before performing an more expensive comparison with something like Levenshtein. Ah, and a trick for Levenshtein is that if you have a target/tolerance, and have the length of your strings stored, you can exclude strings that are too short or long off that stored length without running the more expensive fuzzy comparison. If, for example, you have a starting string of length 10 and only want strings that are at most 2 transformations away, you're wasting your time to test strings that are only 7 characters long etc.

Note that the bad data input problem you describe often comes down to

poor user training and/or
poor UX

It's worth reviewing how bad data is getting in, once you've got your cleanup in good order. If you have a finite set of trainable users, it can help to run a nightly (etc.) scan to detect new likely duplicates, and then go and talk to whoever is generating them. Maybe there's something they don't know you can tell them, maybe there's a problem in the UI that you don't know that they can tell you.

回答3:

There is a like operator. Have you considered trying that?

The following SQL statement selects all customers with a CustomerName starting with "a": Example

SELECT * FROM Customers
WHERE CustomerName LIKE 'a%';

https://www.w3schools.com/sql/sql_like.asp

来源：https://stackoverflow.com/questions/58840005/fuzzy-matching-a-string-in-sql

标签

sql

regex

database

postgresql

levenshtein-distance