问题
I am going to work on some projects to deal with entity deduplication. Datasets (one or more) which may contain duplicate entity. In the realtime, entity may represent the name, address, country, email, social media id in the different form. My goal is to identify that these are possible duplicates based on different weightage for the different entity Info. I am trying to look for a library that is open-source & preferably written in Java.
As I need to process the millions of data, I need to take concern on scaling and performance. Also, the performance should not be in the order of n^2. In the below findings, some use Index-based search using Lucene and some use Data grouping.
Please pour the suggestion which one is better?
Here are my findings so far:
Duke (Java/Lucene)
Comments: Uses genetic algorithms, it's flexible. Since 2016, there had been any updates.
YannBrrd/elasticsearch-entity-resolution (extension of Duke)
Comments: Since 2017, there had been any updates. Also, need to check whether it's compatible with the latest ES and Lucene
dedupeio/dedupe (Python)
Comments: Uses Data grouping method. but It's written in Python.
JedAIToolkit (Java)
Comments: Uses Data grouping method.
Zentity (Elasticsearch Plugin)
Comments: It's a good one. Need to check whether it supports deduplication. So far in the document, it says about entity identity resolution.
Python Record Linkage Toolkit Documentation
Comments: It is in Python.
bakdata/dedupe (Java)
Comments: Not having clear documentation on how to use
I was wondering if anybody else had any others. Also please pour pros and cons of the above.
来源:https://stackoverflow.com/questions/57816223/looking-for-libraries-which-support-deduplication-on-entity