Looking for libraries which support deduplication on entity

主宰稳场 提交于 2021-02-07 23:01:44

问题


I am going to work on some projects to deal with entity deduplication. Datasets (one or more) which may contain duplicate entity. In the realtime, entity may represent the name, address, country, email, social media id in the different form. My goal is to identify that these are possible duplicates based on different weightage for the different entity Info. I am trying to look for a library that is open-source & preferably written in Java.

As I need to process the millions of data, I need to take concern on scaling and performance. Also, the performance should not be in the order of n^2. In the below findings, some use Index-based search using Lucene and some use Data grouping.

Please pour the suggestion which one is better?

Here are my findings so far:

Duke (Java/Lucene)

Comments: Uses genetic algorithms, it's flexible. Since 2016, there had been any updates.

YannBrrd/elasticsearch-entity-resolution (extension of Duke)

Comments: Since 2017, there had been any updates. Also, need to check whether it's compatible with the latest ES and Lucene

dedupeio/dedupe (Python)

Comments: Uses Data grouping method. but It's written in Python.

JedAIToolkit (Java)

Comments: Uses Data grouping method.

Zentity (Elasticsearch Plugin)

Comments: It's a good one. Need to check whether it supports deduplication. So far in the document, it says about entity identity resolution.

Python Record Linkage Toolkit Documentation

Comments: It is in Python.

bakdata/dedupe (Java)

Comments: Not having clear documentation on how to use

I was wondering if anybody else had any others. Also please pour pros and cons of the above.

来源:https://stackoverflow.com/questions/57816223/looking-for-libraries-which-support-deduplication-on-entity

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!