I am trying to find a reliable method for matching duplicate person records within the database. The data has some serious data quality issues which I am also trying to fix
"I am trying to find a reliable method for matching duplicate person records within the database."
Alas there is no such thing. The most you can hope for is a system with a reasonable element of doubt.
SQL> select n1
, n2
, soundex(n1) as sdx_n1
, soundex(n2) as sdx_n2
, utl_match.edit_distance_similarity(n1, n2) as ed
, utl_match.jaro_winkler_similarity(n1, n2) as jw
from t94
order by n1, n2
/
2 3 4 5 6 7 8 9
N1 N2 SDX_ SDX_ ED JW
-------------------- -------------------- ---- ---- ---------- ----------
MARK MARKIE M620 M620 67 93
MARK MARKS M620 M620 80 96
MARK MARKUS M620 M622 67 93
MARKY MARKIE M620 M620 67 89
MARSK MARKS M620 M620 60 95
MARX AMRX M620 A562 50 91
MARX M4RX M620 M620 75 85
MARX MARKS M620 M620 60 84
MARX MARSK M620 M620 60 84
MARX MAX M620 M200 75 93
MARX MRX M620 M620 75 92
11 rows selected.
SQL> SQL> SQL>
The big advantage of SOUNDEX is that it tokenizes the string. This means it gives you something which can be indexed: this is incredibly valuable when it comes to large amounts of data. On the other hand it is old and crude. There are newer algorithms around, such as Metaphone and Double Metaphone. You should be able to find PL/SQL implemenations of them via Google.
The advantage of scoring is that they allow for a degree of fuzziness; so you can find all rows where name_score >= 90%
. The crushing disadvantage is that the scores are relative and so you cannot index them. This sort of comparison kills you with large volumes.
What this means is:
In my experience concatenating the tokens (first name, last name) is a mixed blessing. It solves certain problems (such as whether the road name appears in address line 1 or address line 2) but causes other problems: consider scoring GRAHAM OLIVER vs OLIVER GRAHAM against scoring OLIVER vs OLIVER, GRAHAM vs GRAHAM, OLIVER vs GRAHAM and GRAHAM vs OLIVER.
Whatever you do you will still end up with false positives and missed hits. No algorithm is proof against typos (although Jaro Winkler did pretty good with MARX vs AMRX).