问题
I have two dataframes from different sources that refer to the same people, but due to errors from self-reported data, the dates may be slightly off.
Example data:
df1 <- data.frame(name= c("Ann", "Betsy", "Charlie", "Dave"), dob= c(as.Date("2000-01-01", "%Y-%m-%d"), as.Date("2001-01-01", "%Y-%m-%d"), as.Date("2002-01-01", "%Y-%m-%d"), as.Date("2003-01-01", "%Y-%m-%d")), stringsAsFactors=FALSE)
df2 <- data.frame(name= c("Ann", "Charlie", "Elmer", "Fred"), dob= c(as.Date("2000-01-11", "%Y-%m-%d"), as.Date("2004-01-01", "%Y-%m-%d"), as.Date("2001-01-01", "%Y-%m-%d"), as.Date("2006-01-01", "%Y-%m-%d")), stringsAsFactors=FALSE)
I want to match by exact name, with dplyr like:
library(dplyr)
inner_join(df1, df2, by = c("name"))
name dob.x dob.y
Ann 2000-01-01 2000-01-11
Charlie 2002-01-01 2004-01-01
but also by dates of birth (dob) within 30 days, with the fuzzyjoin package like:
library(fuzzyjoin)
difference_inner_join(df1, df2, by=c("dob"), max_dist = 30)
name.x dob.x name.y dob.y
Ann 2000-01-01 Ann 2000-01-11
Betsy 2001-01-01 Elmer 2001-01-01
But combine both criteria, so that only Ann would be returned
回答1:
Well you could do this:
difference_inner_join(df1, df2, by=c("dob"), max_dist = 30) %>%
filter(name.x == name.y)
name.x dob.x name.y dob.y
1 Ann 2000-01-01 Ann 2000-01-11
回答2:
Relying on dplyr and base R alone.
I rarely need fuzzy joins. inner_join
and subsequently filter
usually is enough
inner_join(df1, df2, by = c("name")) %>%
filter(abs(difftime(dob.x,dob.y, "days"))<30)
result
name dob.x dob.y
1 Ann 2000-01-01 2000-01-11
来源:https://stackoverflow.com/questions/55862854/match-some-columns-exactly-and-some-partially-with-inner-join