How to merge two data frame based on partial string match with R?

瘦欲@ 提交于 2019-11-28 13:13:55

This might work for you and it handles duplicates:

First some dummy data:

df1 <- data.frame(name=c("George", "Abraham", "Barack"), stringsAsFactors = F)
df2 <- data.frame(president=c("Thanks, Obama (Barack)","Lincoln, Abraham, George""George Washington"), stringsAsFactors = F)

Find the code in the full description using grep:

idx2 <- sapply(df1$name, grep, df2$president)

This can result in multiple matches if multiple descriptions match the code so here I duplicate the original indices so the results align:

idx1 <- sapply(seq_along(idx2), function(i) rep(i, length(idx2[[i]])))

"merge" the datasets with cbind aligned on the new indices:

> cbind(df1[unlist(idx1),,drop=F], df2[unlist(idx2),,drop=F])
       name                president
1    George Lincoln, Abraham, George
1.1  George        George Washington
2   Abraham Lincoln, Abraham, George
3    Barack   Thanks, Obama (Barack)

(Your question is a bit vague - it would be better with some sample/foobar data - so this answer unfortunately is too)

Try this:

?grep                                       # Pattern Matching and Replacement
X <- data.frame(a = letters[1:10])
grep(pattern = "c", x = X$a)                # returns position of "c": 3
grepl(pattern = "c", x = X$a)               # returns a vector of bools: [ F F T F F ... ]
X[grepl(pattern = "c", x = X$a),"a") <- "C" # replaces "c" with "C"

PS:

  • depending on how big / dirty your element names lists are, I've often found it useful to (i) create a clean (short and unambiguous) dictionary of names, (ii) add a new column with this new name to each original list and (iii) perform the merge with these columns;
  • aside from base::merge, I like to use dplyr's join functions (mostly because I like their cheat sheet);
标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!