问题
I’m trying to do a fuzzy logic join in R between two datasets:
- first data set has the name of a location and a column called
config
- second data set has the name of a location and two additional attributes that need to be summarized before they are joined to the first data set.
I would like to use the name
column to join between the two data sets. However the name
column may have additional or leading characters in either data set or have one word contained inside of a larger word. So for example if we looked at these two data sets, I'd like the name OPAL to join to the OPALAS, and SAUSALITO Y to join to SAUSALITO.
Dataset1:
Name Config
ALTO D BB
CONTRA ST
EIGHT A DD
OPALAS BB
SAUSALITO Y AA
SOLANO J ST
Dataset2:
Name Age Rank
ALTO D 50 2
ALTO D 20 6
CONTRA 10 10
CONTRA 15 15
EIGHTH 18 21
OPAL 19 4
SAUSALITO 2 12
SOLANO 34 43
Code for Summarizing of Dataset2
Data2a <- summaryBy(Age ~ Name,FUN=c(mean), data=Data2,na.rm=TRUE)
Data2b <- summaryBy(Rank ~ Name,FUN=c(sum), data=Data2,na.rm=TRUE)
Data2 <- data.frame(Data2a$Name, Data2a$Age.mean, Data2b$Rank.sum)
Desired Outcome:
Name Config Age Rank
ALTO D BB 35 8
CONTRA ST 12.5 25
EIGHT A DD 18 21
OPALAS BB 19 4
SAUSALITO Y AA 12 5
SOLANO J ST 34 43
回答1:
I was able to join the two datasets, using the fuzzyjoin package:
library(fuzzyjoin)
stringdist_inner_join(Dataset1, Data2,
by ="Name", distance_col = NULL)
来源:https://stackoverflow.com/questions/44359860/joining-two-datasets-using-fuzzy-logic