问题
Here is a sample tibble:
test <- tibble(a = c("dd1","dd2","dd3","dd4","dd5"),
name = c("a", "b", "c", "d", "e"),
b = c("dd3","dd4","dd1","dd5","dd2"))
And I want to add a new column b_name as self-join to test using:
dplyr::inner_join(test, test, by = c("a" = "b"))
My table is way to large (2.7M rows with 4 columns) and I get the following error:
Error: std::bad_alloc
Please advise how to do it right / best practice.
My final goal is to get the following structure:
a name b b_name
dd1 a dd3 c
dd2 b dd4 d
dd3 c dd1 a
dd4 d dd5 e
dd5 e dd2 b
回答1:
Another option is fmatch
from fastmatch
library(fastmatch)
test$b_name <- with(test, name[fmatch(b, a)])
test$b_name
#[1] "c" "d" "a" "e" "b"
According to ?fmatch
description
fmatch is a faster version of the built-in match() function.
回答2:
For that number of rows, I think data.table
is probably going to give you a lot more speed. So here's a data.table solution:
library(data.table)
setDT(test)
Approach #1: self-join:
test[test, on = c(a = "b")]
# test[test, on = .(a == b)] ## identical
Approach # 2: using data.table::merge:
merge(test, test, by.x = "a", by.y = "b")
回答3:
Here is another simple solution using match
function from base
and mutate
from dplyr
package:
library(dplyr)
new_test <- test %>%
mutate(b_name = name[match(test$b,test$a)])
However, be careful with very long tables as match
might not be the best implementation.
来源:https://stackoverflow.com/questions/58049297/self-joining-in-r