How to detect duplicate data?

后端未结

关注

 11  1609

I have got a simple contacts database but I\'m having problems with users entering in duplicate data. I have implemented a simple data comparison but unfortunately the duplicate

相关标签:

11条回答

一向

2021-02-01 08:44

So is there some sort of algorithm that can give a percentage for how similar an entry is to another?

Algorithms as Soundex and Edit distances (as suggested in a previous post) can solve some of your problems. However, if you are serious about cleaning your data, this will not be enough. As others have stated "Bill" does not sound anything like "William".

The best solution I have found is to use a reduction algorithm and table to reduce the names to it's root name.

To your regular Address table, add Root-versions of the names, e.g Person (Firstname, RootFirstName, Surname, Rootsurname....)

Now, create a mapping table. FirstNameMappings (Primary KEY Firstname, Rootname)

Populate your Mapping table by: Insert IGNORE (select Firstname, "UNDEFINED" from Person) into FirstNameMappings

This will add all firstnames that you have in your person table together with the RootName of "UNDEFINED"

Now, sadly, you will have to go through all the unique first names and map them to a RootName. For example "Bill", "Billl" and "Will" should all be translated to "William" This is very time consuming, but if data quality really is important for you I think it's one of the best ways.

Now use the newly created mapping table to update the "Rootfirstname" field in your Person table. Repeat for surname and address. Once this is done you should be able to detect duplicates without suffering from spelling errors.

0 讨论(0)
发布评论:

提交评论
- 加载中...
暗喜

2021-02-01 08:46
I imagine that this problem is well understood but what occurs to me on first reading is:
- compare fields individually
- count those that match (for a possibly loose definition of match, and possibly weighing the fields differently)
- present for human intervention any cases which pass some threshold
Use your existing database to get a good first guess for the threshold, and correct as you accumulate experience.

You may prefer a fairly strong bias toward false positives, at least at first.
0 讨论(0)
发布评论:

提交评论
- 加载中...
不要未来只要你来

2021-02-01 08:46

If you have access SSIS check out the Fuzzy grouping and Fuzzy lookup transformation.

http://www.sqlteam.com/article/using-fuzzy-lookup-transformations-in-sql-server-integration-services

http://msdn.microsoft.com/en-us/library/ms137786.aspx

0 讨论(0)
发布评论:

提交评论
- 加载中...
我在风中等你

2021-02-01 08:46

If you have a large database with string fields, you can very quickly find a lot of duplicates by using the simhash algorithm.

0 讨论(0)
发布评论:

提交评论
- 加载中...
遥遥无期

2021-02-01 08:54

You might also want to look into probabilistic matching.

0 讨论(0)
发布评论:

提交评论
- 加载中...

上一页 1 2