record-linkage

R : Record Linkage problem with all fields combined in 1 column

久未见 提交于 2019-12-11 17:53:36
问题 I have to match column a from dataset A to column b in dataset B. But the different variables aren't in separate fields(columns a, b, c) but in the same one. I have been looking at packages RecordLinkage & fastLink they work great with the fields being separated. Separate fields : # make dataframe 1 fname <- c("ash", "aalok", "aaron", "adam", "adrian", "ajay") lname <- c("perry", "phillips", "picardo", "pinck", "pinnick-flood", "pledger") dob <- c(1957, 1971, 1948, 1961, 1972, 2000) city <- c

Displaying corresponding values in data frame in R

一个人想着一个人 提交于 2019-12-11 12:27:44
问题 Please check the code below, I have created a data frame using three variables below, the variable "y123" computes the similarity between columns a2 with a1. The variable "y123" gives me total 16 values where every a1 value gets compared with a2. My need is that when a particular "a1" value is compared with a particular "a2" value, I want the corresponding "a3" value next to "a2" be displayed besides. So the result should be a data frame with column y123 and a second column with corresponding

how to determine if a record in every source, represents the same person

流过昼夜 提交于 2019-12-06 10:29:28
问题 I have several sources of tables with personal data, like this: SOURCE 1 ID, FIRST_NAME, LAST_NAME, FIELD1, ... 1, jhon, gates ... SOURCE 2 ID, FIRST_NAME, LAST_NAME, ANOTHER_FIELD1, ... 1, jon, gate ... SOURCE 3 ID, FIRST_NAME, LAST_NAME, ANOTHER_FIELD1, ... 2, jhon, ballmer ... So, assuming that records with ID 1, from sources 1 and 2, are the same person, my problem is how to determine if a record in every source, represents the same person . Additionally, sure not every records exists in

Fuzzy logic on big datasets using Python

烈酒焚心 提交于 2019-12-06 07:14:08
问题 My team has been stuck with running a fuzzy logic algorithm on a two large datasets. The first (subset) is about 180K rows contains names, addresses, and emails for the people that we need to match in the second (superset). The superset contains 2.5M records. Both have the same structure and the data has been cleaned already, i.e. addresses parsed, names normalized, etc. ContactID int, FullName varchar(150), Address varchar(100), Email varchar(100) The goal is to match values in a row of

Is there a open source implementation for Fellegi-Sunter? [closed]

元气小坏坏 提交于 2019-12-06 05:32:37
Closed. This question is off-topic . It is not currently accepting answers. Want to improve this question? Update the question so it's on-topic for Stack Overflow. Closed 2 years ago . Is there a open source implementation for Fellegi-Sunter? Is this what you are looking for? Wikipedia: Record Linkage Check under "Software implementations" for possible solutions. Here are some open source implementations: http://github.com/dedupeio/dedupe (author of this) https://sourceforge.net/projects/febrl/ https://github.com/larsga/Duke 来源: https://stackoverflow.com/questions/5152217/is-there-a-open

how to determine if a record in every source, represents the same person

与世无争的帅哥 提交于 2019-12-04 13:18:13
I have several sources of tables with personal data, like this: SOURCE 1 ID, FIRST_NAME, LAST_NAME, FIELD1, ... 1, jhon, gates ... SOURCE 2 ID, FIRST_NAME, LAST_NAME, ANOTHER_FIELD1, ... 1, jon, gate ... SOURCE 3 ID, FIRST_NAME, LAST_NAME, ANOTHER_FIELD1, ... 2, jhon, ballmer ... So, assuming that records with ID 1, from sources 1 and 2, are the same person, my problem is how to determine if a record in every source, represents the same person . Additionally, sure not every records exists in all sources. All the names, are written in spanish, mainly. In this case, the exact matching needs to

Duke Fast Deduplication: java.lang.UnsupportedOperationException: Operation not yet supported?

北慕城南 提交于 2019-12-02 06:27:55
I'm trying to use the Duke Fast Deduplication Engine to search for some duplicate records in the database at the company where I work. I run it from the command line like this: java -cp "C:\utils\duke-0.6\duke-0.6.jar;C:\utils\duke-0.6\lucene-core-3.6.1.jar" no.priv.garshol.duke.Duke --showmatches --verbose .\config.xml But I get an error: Exception in thread "main" java.lang.UnsupportedOperationException: Operation no t yet supported at sun.jdbc.odbc.JdbcOdbcResultSet.isClosed(Unknown Source) at no.priv.garshol.duke.datasources.JDBCDataSource$JDBCIterator.close(JD BCDataSource.java:115) at no

Tools for matching name/address data [closed]

霸气de小男生 提交于 2019-11-28 17:19:20
Here's an interesting problem. I have an oracle database with name & address information which needs to be kept current. We get data feeds from a number of different gov't sources, and need to figure out matches, and whether or not to update the db with the data, or if a new record needs to be created. There isn't any sort of unique identifier that can be used to tie records together, and the data quality isn't always that good - there will always be typos, people using different names (i.e. Joe vs. Joseph), etc. I'd be interested in hearing from anyone who's worked on this type of problem

Fuzzy matching deduplication in less than exponential time?

荒凉一梦 提交于 2019-11-28 04:34:48
I have a large database (potentially in the millions of records) with relatively short strings of text (on the order of street address, names, etc). I am looking for a strategy to remove inexact duplicates, and fuzzy matching seems to be the method of choice. My issue: many articles and SO questions deal with matching a single string against all records in a database. I am looking to deduplicate the entire database at once. The former would be a linear time problem (comparing a value against a million other values, calculating some similarity measure each time). The latter is an exponential

Tools for matching name/address data [closed]

为君一笑 提交于 2019-11-27 10:36:40
问题 Closed . This question needs to be more focused. It is not currently accepting answers. Want to improve this question? Update the question so it focuses on one problem only by editing this post. Closed 3 years ago . Here's an interesting problem. I have an oracle database with name & address information which needs to be kept current. We get data feeds from a number of different gov't sources, and need to figure out matches, and whether or not to update the db with the data, or if a new