Merging two Data Frames using Fuzzy/Approximate String Matching in R

折月煮酒 提交于 2019-12-14 03:17:01

问题


DESCRIPTION

I have two datasets with information that I need to merge. The only common fields that I have are strings that do not perfectly match and a numerical field that can be substantially different

The only way to explain the problem is to show you the data. Here is a.csv and b.csv. I am trying to merge B to A.

There are three fields in B and four in A. Company Name (File A Only), Fund Name, Asset Class, and Assets. So far, my focus has been on attempting to match the Fund Names by replacing words or parts of the strings to create exact matches and then using:

a <- read.table(file = "http://bertelsen.ca/R/a.csv",header=TRUE, sep=",", na.strings=F, strip.white=T, blank.lines.skip=F, stringsAsFactors=T) 
b <- read.table(file = "http://bertelsen.ca/R/b.csv",header=TRUE, sep=",", na.strings=F, strip.white=T, blank.lines.skip=F, stringsAsFactors=T)
merge(a,b, by="Fund.Name") 

However, this only brings me to about 30% matching. The rest I have to do by hand.

Assets is a numerical field that is not always correct in either and can vary wildly if the fund has low assets. Asset Class is a string field that is "generally" the same in both files, however, there are discrepancies.

Adding to the complication are the different series of funds, in File B. For example:

AGF Canadian Value

AGF Canadian Value-D

In these cases, I have to choose the one that is not seried, or choose the one that is called "A", "-A", or "Advisor" as the match.

QUESTION

What would you say is the best approach? This excercise is something that I have to do on a monthly basis and matching them manually is incredibly time consuming. Examples of code would be instrumental.

IDEAS

One method that I think may work is normalizing the strings based on the first capitalized letter of each word in the string. But I haven't been able to figure out how to pull that off using R.

Another method I considered was creating an index of matches based on a combination of assets, fund name, asset class and company. But again, I'm not sure how to do this with R. Or, for that matter, if it's even possible.

Examples of code, comments, thoughts and direction are greatly appreciated!


回答1:


Approximate string matching is not a good idea since an incorrect match would invalidate the whole analysis. If the names from each source is the same each time, then building indexes seems the best option to me too. This is easily done in R:

Suppose you have the data:

a<-data.frame(name=c('Ace','Bayes'),price=c(10,13))
b<-data.frame(name=c('Ace Co.','Bayes Inc.'),qty=c(9,99))

Build an index of names for each source one time, perhaps using pmatch etc. as a starting point and then validating manually.

a.idx<-data.frame(name=c('Ace','Bayes'),idx=c(1,2))
b.idx<-data.frame(name=c('Ace Co.','Bayes Inc.'), idx=c(1,2))

Then for each run merge using:

a.rich<-merge(a,a.idx,by="name")
b.rich<-merge(b,b.idx,by="name")
merge(a.rich,b.rich,by="idx")

Which would give us:

  idx name.x price     name.y qty
1   1    Ace    10    Ace Co.   9
2   2  Bayes    13 Bayes Inc.  99



回答2:


It's highly recommended to use the dgrtwo/fuzzyjoin package. stringdist_inner_join(a,b, by="Fund.Name")




回答3:


One quick suggestion: try to do some matching on the different fields separately before using merge. The simplest approach is with the pmatch function, although R has no shortage of text matching functions (e.g. agrep). Here's a simple example:

pmatch(c("med", "mod"), c("mean", "median", "mode"))

For your dataset, this matches all the fund names out of a:

> nrow(merge(a,b,x.by="Fund.Name", y.by="Fund.name"))
[1] 58
> length(which(!is.na(pmatch(a$Fund.Name, b$Fund.name))))
[1] 238

Once you create matches, you can easily merge them together using those instead.




回答4:


I'm a Canada local as well, recognize the fund names.

This is a difficult one as each of the data providers picks their own form for the individual fund names. Some use different structure like all end in either Fund or Class others are all over the place. Each seems to choose their own short-forms as well and these change regularly.

That's why so many people like you are doing this by hand on a regular basis. Some of the consulting firms do list indexes to link various sources, not sure if you've explored that route?

As Shane and Marek pointed out this is a matching task more than a straight join. Many companies are struggling with this one. I'm in the middle of my work on this...

Jay



来源:https://stackoverflow.com/questions/38620373/merge-two-datasets-in-r-using-a-common-text-in-the-fields

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!