How to detect duplicate data?

后端 未结 11 1584
半阙折子戏
半阙折子戏 2021-02-01 08:24

I have got a simple contacts database but I\'m having problems with users entering in duplicate data. I have implemented a simple data comparison but unfortunately the duplicate

11条回答
  •  醉梦人生
    2021-02-01 08:36

    I'm not sure it will work well for the names vs nicknames problem, but the most common algorithm in this sort of area would be the edit distance / Levenshtein distance algorithm. It's basically a count of the number of character changes, additions and removals required to turn one item into another.

    For names, I'm not sure you're ever going to get good results with a purely algorithmic approach - What you really need is masses of data. Take, for example, how much better Google spelling suggestions are than those in a normal desktop application. This is because Google can process billions of web queries and look at what queries lead to each other, what 'did you mean' links actually get clicked etc.

    There are a few companies which specialise in the name matching problem (mostly for national security and fraud applications). The one I could remember, Search Software America seems to have been bought out by these guys http://www.informatica.com/products_services/identity_resolution/Pages/index.aspx, but I suspect any of these sorts of solutions would be far to expensive for a contacts application.

提交回复
热议问题