Self Joining in R

淺唱寂寞╮ 提交于 2021-01-28 21:06:13

问题


Here is a sample tibble:

test <- tibble(a = c("dd1","dd2","dd3","dd4","dd5"), 
               name = c("a", "b", "c", "d", "e"), 
               b = c("dd3","dd4","dd1","dd5","dd2"))

And I want to add a new column b_name as self-join to test using:

dplyr::inner_join(test, test, by = c("a" = "b"))

My table is way to large (2.7M rows with 4 columns) and I get the following error:

Error: std::bad_alloc

Please advise how to do it right / best practice.

My final goal is to get the following structure:

   a     name  b     b_name
   dd1   a     dd3   c
   dd2   b     dd4   d
   dd3   c     dd1   a
   dd4   d     dd5   e
   dd5   e     dd2   b 

回答1:


Another option is fmatch from fastmatch

library(fastmatch)
test$b_name <- with(test, name[fmatch(b, a)])
test$b_name
#[1] "c" "d" "a" "e" "b"

According to ?fmatch description

fmatch is a faster version of the built-in match() function.




回答2:


For that number of rows, I think data.table is probably going to give you a lot more speed. So here's a data.table solution:

library(data.table)
setDT(test)

Approach #1: self-join:

test[test, on = c(a = "b")]
# test[test, on = .(a == b)] ## identical

Approach # 2: using data.table::merge:

merge(test, test, by.x = "a", by.y = "b")



回答3:


Here is another simple solution using match function from base and mutate from dplyr package:

library(dplyr)

new_test <- test %>% 
  mutate(b_name = name[match(test$b,test$a)])

However, be careful with very long tables as match might not be the best implementation.



来源:https://stackoverflow.com/questions/58049297/self-joining-in-r

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!