How do I do one fuzzy and one exact match in a dataframe?

风格不统一 提交于 2021-02-10 14:28:45

问题


I want to be able to fuzzy match one column and exact match another column.

Say I df1 looks like this:

And df2 looks like this:

I want to fuzzy match the "Name" but exact match the "Year." So "Ashley" and "Ashlee" would be a match. This is what I have so far:

res <- fuzzy_left_join(
  df,
  df2,
  by=c("Year","Name"),
  list(`==`, function(x,y) stringdist(tolower(x), tolower(y), method="lv") <= 3)
)
res %>% 
  select(Year = Year.x, everything(), - Year.y)

It appears to be over-matching, though. Not sure what's going on.


回答1:


It seems you are on the right track (hard to tell without your data or you showing us your result!)

The fuzzyjoin will provide all answers with string distance <=3, which may be the "overmatching" you describe.

You can use %>% group_by(Year,Name) %>% slice_min(dist) to get the best answer according to distance.



来源:https://stackoverflow.com/questions/58442426/how-do-i-do-one-fuzzy-and-one-exact-match-in-a-dataframe

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!