Techniques for finding near duplicate records

后端 未结 4 433
太阳男子
太阳男子 2020-11-29 17:35

I\'m attempting to clean up a database that, over the years, had acquired many duplicate records, with slightly different names. For example, in the companies table, there

4条回答
  •  有刺的猬
    2020-11-29 18:21

    What you're doing is called record linkage, and it's been a huge field of research over many decades already. Luckily for you, there's a whole bunch of tools out there that are ready-made for this sort of thing. Basically, you can point them at your database, set up some cleaning and comparators (like Levenshtein or Jaro-Winkler or ...), and they'll go off and do the job for you.

    These tools generally have features in place to solve the performance issues, so that even though Levenshtein is slow they can run fast because most record pairs never get compared at all.

    The Wikipedia link above has links to a number of record linkage tools you can use. I've personally written one called Duke in Java, which I've used successfully for exactly this. If you want something big and expensive you can buy a Master Data Management tool.

提交回复
热议问题