Matching Oracle duplicate column values using Soundex, Jaro Winkler and Edit Distance (UTL_MATCH)

前端 未结 1 1679
半阙折子戏
半阙折子戏 2021-01-01 06:20

I am trying to find a reliable method for matching duplicate person records within the database. The data has some serious data quality issues which I am also trying to fix

相关标签:
1条回答
  • 2021-01-01 06:40

    "I am trying to find a reliable method for matching duplicate person records within the database."

    Alas there is no such thing. The most you can hope for is a system with a reasonable element of doubt.

    SQL> select n1
           , n2
           , soundex(n1) as sdx_n1
           , soundex(n2) as sdx_n2
           , utl_match.edit_distance_similarity(n1, n2) as ed
           , utl_match.jaro_winkler_similarity(n1, n2) as jw   
    from t94
    order by n1, n2
    /
    
    
      2    3    4    5    6    7    8    9  
    N1                   N2                   SDX_ SDX_         ED         JW
    -------------------- -------------------- ---- ---- ---------- ----------
    MARK                 MARKIE               M620 M620         67         93
    MARK                 MARKS                M620 M620         80         96
    MARK                 MARKUS               M620 M622         67         93
    MARKY                MARKIE               M620 M620         67         89
    MARSK                MARKS                M620 M620         60         95
    MARX                 AMRX                 M620 A562         50         91
    MARX                 M4RX                 M620 M620         75         85
    MARX                 MARKS                M620 M620         60         84
    MARX                 MARSK                M620 M620         60         84
    MARX                 MAX                  M620 M200         75         93
    MARX                 MRX                  M620 M620         75         92
    
    11 rows selected.
    
    SQL> SQL> SQL> 
    

    The big advantage of SOUNDEX is that it tokenizes the string. This means it gives you something which can be indexed: this is incredibly valuable when it comes to large amounts of data. On the other hand it is old and crude. There are newer algorithms around, such as Metaphone and Double Metaphone. You should be able to find PL/SQL implemenations of them via Google.

    The advantage of scoring is that they allow for a degree of fuzziness; so you can find all rows where name_score >= 90%. The crushing disadvantage is that the scores are relative and so you cannot index them. This sort of comparison kills you with large volumes.

    What this means is:

    1. You need a mix of strategies. No single algorithm will solve your problem.
    2. Data cleansing is useful. Compare the scores for MARX vs MRX and M4RX: stripping numbers out of names improves the hit rate.
    3. You cannot score big volumes of names on the fly. Use tokenizing and pre-scoring if you can. Use caching if you don't have a lot of churn. Use partitioning if you can afford it.
    4. Use a Oracle Text (or similar) to build a thesaurus of nicknames and variants.
    5. Oracle 11g introduced specific name search functionality to Oracle Text. Find out more.
    6. Build a table of canonical names for scoring and link actual data records to that.
    7. Use other data values, especially indexable ones like date of birth, to pre-filter large volumes of names or to increase confidence in proposed matches.
    8. Be aware that other data values come with their own problems: is someone born on 31/01/11 eleven months old or eighty years old?
    9. Remember that names are tricky, especially when you have to consider names which have been romanized: there are over four hundred different ways of spelling Moammar Khadaffi (in the roman alphabet) - and not even Google can agree on which variant is the most canonical.

    In my experience concatenating the tokens (first name, last name) is a mixed blessing. It solves certain problems (such as whether the road name appears in address line 1 or address line 2) but causes other problems: consider scoring GRAHAM OLIVER vs OLIVER GRAHAM against scoring OLIVER vs OLIVER, GRAHAM vs GRAHAM, OLIVER vs GRAHAM and GRAHAM vs OLIVER.

    Whatever you do you will still end up with false positives and missed hits. No algorithm is proof against typos (although Jaro Winkler did pretty good with MARX vs AMRX).

    0 讨论(0)
提交回复
热议问题