How can I match fuzzy match strings from two datasets?

问题

I\'ve been working on a way to join two datasets based on a imperfect string, such as a name of a company. In the past I had to match two very dirty lists, one list had names and financial information, another list had names and address. Neither had unique IDs to match on! ASSUME THAT CLEANING HAS ALREADY BEEN APPLIED AND THERE MAYBE TYPOS AND INSERTIONS.

So far AGREP is the closest tool I\'ve found that might work. I can use levenshtein distances in the AGREP package, which measure the number of deletions, insertions and substitutions between two strings. AGREP will return the string with the smallest distance (the most similar).

However, I\'ve been having trouble turning this command from a single value to apply it to an entire data frame. I\'ve crudely used a for loop to repeat the AGREP function, but there\'s gotta be an easier way.

See the following code:

a<-data.frame(name=c(\'Ace Co\',\'Bayes\', \'asd\', \'Bcy\', \'Baes\', \'Bays\'),price=c(10,13,2,1,15,1))
b<-data.frame(name=c(\'Ace Co.\',\'Bayes Inc.\',\'asdf\'),qty=c(9,99,10))

for (i in 1:6){
    a$x[i] = agrep(a$name[i], b$name, value = TRUE, max = list(del = 0.2, ins = 0.3, sub = 0.4))
    a$Y[i] = agrep(a$name[i], b$name, value = FALSE, max = list(del = 0.2, ins = 0.3, sub = 0.4))
}

回答1:

The solution depends on the desired cardinality of your matching a to b. If it's one-to-one, you will get the three closest matches above. If it's many-to-one, you will get six.

One-to-one case (requires assignment algorithm):

When I've had to do this before I treat it as an assignment problem with a distance matrix and an assignment heuristic (greedy assignment used below). If you want an "optimal" solution you'd be better off with optim.

Not familiar with AGREP but here's example using stringdist for your distance matrix.

library(stringdist)
d <- expand.grid(a$name,b$name) # Distance matrix in long form
names(d) <- c("a_name","b_name")
d$dist <- stringdist(d$a_name,d$b_name, method="jw") # String edit distance (use your favorite function here)

# Greedy assignment heuristic (Your favorite heuristic here)
greedyAssign <- function(a,b,d){
  x <- numeric(length(a)) # assgn variable: 0 for unassigned but assignable, 
  # 1 for already assigned, -1 for unassigned and unassignable
  while(any(x==0)){
    min_d <- min(d[x==0]) # identify closest pair, arbitrarily selecting 1st if multiple pairs
    a_sel <- a[d==min_d & x==0][1] 
    b_sel <- b[d==min_d & a == a_sel & x==0][1] 
    x[a==a_sel & b == b_sel] <- 1
    x[x==0 & (a==a_sel|b==b_sel)] <- -1
  }
  cbind(a=a[x==1],b=b[x==1],d=d[x==1])
}
data.frame(greedyAssign(as.character(d$a_name),as.character(d$b_name),d$dist))

Produces the assignment:

       a          b       d
1 Ace Co    Ace Co. 0.04762
2  Bayes Bayes Inc. 0.16667
3    asd       asdf 0.08333

I'm sure there's a much more elegant way to do the greedy assignment heuristic, but the above works for me.

Many-to-one case (not an assignment problem):

do.call(rbind, unname(by(d, d$a_name, function(x) x[x$dist == min(x$dist),])))

Produces the result:

   a_name     b_name    dist
1  Ace Co    Ace Co. 0.04762
11   Baes Bayes Inc. 0.20000
8   Bayes Bayes Inc. 0.16667
12   Bays Bayes Inc. 0.20000
10    Bcy Bayes Inc. 0.37778
15    asd       asdf 0.08333

Edit: use method="jw" to produce desired results. See help("stringdist-package")

回答2:

Here is a solution using the fuzzyjoin package. It uses dplyr-like syntax and stringdist as one of the possible types of fuzzy matching.

As suggested by C8H10N4O2, the stringdist method="jw" creates the best matches for your example.

As suggested by dgrtwo, the developer of fuzzyjoin, I used a large max_dist and then used dplyr::group_by and dplyr::top_n to get only the best match with minimum distance.

a <- data.frame(name = c('Ace Co', 'Bayes', 'asd', 'Bcy', 'Baes', 'Bays'),
                price = c(10, 13, 2, 1, 15, 1))
b <- data.frame(name = c('Ace Co.', 'Bayes Inc.', 'asdf'),
                qty = c(9, 99, 10))

library(fuzzyjoin); library(dplyr);

stringdist_join(a, b, 
                by = "name",
                mode = "left",
                ignore_case = FALSE, 
                method = "jw", 
                max_dist = 99, 
                distance_col = "dist") %>%
  group_by(name.x) %>%
  top_n(1, -dist)

#> # A tibble: 6 x 5
#> # Groups:   name.x [6]
#>   name.x price     name.y   qty       dist
#>   <fctr> <dbl>     <fctr> <dbl>      <dbl>
#> 1 Ace Co    10    Ace Co.     9 0.04761905
#> 2  Bayes    13 Bayes Inc.    99 0.16666667
#> 3    asd     2       asdf    10 0.08333333
#> 4    Bcy     1 Bayes Inc.    99 0.37777778
#> 5   Baes    15 Bayes Inc.    99 0.20000000
#> 6   Bays     1 Bayes Inc.    99 0.20000000

回答3:

I am not sure if this is a useful direction for you, John Andrews, but it gives you another tool (from the RecordLinkage package) and might help.

install.packages("ipred")
install.packages("evd")
install.packages("RSQLite")
install.packages("ff")
install.packages("ffbase")
install.packages("ada")
install.packages("~/RecordLinkage_0.4-1.tar.gz", repos = NULL, type = "source")

require(RecordLinkage) # it is not on CRAN so you must load source from Github, and there are 7 dependent packages, as per above

compareJW <- function(string, vec, cutoff) {
  require(RecordLinkage)
  jarowinkler(string, vec) > cutoff
}

a<-data.frame(name=c('Ace Co','Bayes', 'asd', 'Bcy', 'Baes', 'Bays'),price=c(10,13,2,1,15,1))
b<-data.frame(name=c('Ace Co.','Bayes Inc.','asdf'),qty=c(9,99,10))
a$name <- as.character(a$name)
b$name <- as.character(b$name)

test <- compareJW(string = a$name, vec = b$name, cutoff = 0.8)  # pick your level of cutoff, of course
data.frame(name = a$name, price = a$price, test = test)

> data.frame(name = a$name, price = a$price, test = test)
    name price  test
1 Ace Co    10  TRUE
2  Bayes    13  TRUE
3    asd     2  TRUE
4    Bcy     1 FALSE
5   Baes    15  TRUE
6   Bays     1 FALSE

回答4:

Agreed with above answer "Not familiar with AGREP but here's example using stringdist for your distance matrix." but add-on the signature function as below from Merging Data Sets Based on Partially Matched Data Elements will be more accurate since the calculation of LV is based on position/addition/deletion

##Here's where the algorithm starts...
##I'm going to generate a signature from country names to reduce some of the minor differences between strings
##In this case, convert all characters to lower case, sort the words alphabetically, and then concatenate them with no spaces.
##So for example, United Kingdom would become kingdomunited
##We might also remove stopwords such as 'the' and 'of'.
signature=function(x){
  sig=paste(sort(unlist(strsplit(tolower(x)," "))),collapse='')
  return(sig)
}

回答5:

I use lapply for those circumstances:

yournewvector: lapply(yourvector$yourvariable, agrep, yourothervector$yourothervariable, max.distance=0.01),

then to write it as a csv it's not so straightforward:

write.csv(matrix(yournewvector, ncol=1), file="yournewvector.csv", row.names=FALSE)

回答6:

Here is what I used for getting number of times a company appears in a list though the company names are inexact matches,

step.1 Install phonics Package

step.2 create a new column called "soundexcodes" in "mylistofcompanynames"

step.3 Use soundex function to return soundex codes of the company names in "soundexcodes"

step.4 Copy the company names AND corresponding soundex code into a new file (2 columns called "companynames" and "soundexcode") called "companysoundexcodestrainingfile"

step.5 Remove duplicates of soundexcodes in "companysoundexcodestrainingfile"

step.6 Go through the list of remaining company names and change the names as you want it to appear in your original company

example: Amazon Inc A625 can be Amazon A625 Accenture Limited A455 can be Accenture A455

step.6 Perform a left_join or (simple vlookup) between companysoundexcodestrainingfile$soundexcodes and mylistofcompanynames$soundexcodes by "soundexcodes"

step.7 The result should have the original list with a new column called "co.y" which has the name of the company the way you left it in the training file.

step.8 Sort "co.y" and check if most of the company names are matched correctly,if so replace the old company names with the new ones given by vlookup of the soundex code.

来源：https://stackoverflow.com/questions/26405895/how-can-i-match-fuzzy-match-strings-from-two-datasets

标签

string-matching

fuzzy-search

fuzzy-comparison