I am using R and RStudio for the first time to work with a very large dataset (15 million cases) with many columns of data. To facilitate analysis, I need to search a range of
Another idea using base R with lapply
:
uniq_dxs <- as.character(unique(melt(df1, id.vars = NULL)$value))
df1[, paste0("var", uniq_dxs)] <- lapply(uniq_dxs, function(x) rowSums(df1==x) > 0)
df1
# Dx1 Dx2 Dx3 var001 var231 var245 var234 var777 var456 var444
#1 001 234 456 TRUE FALSE FALSE TRUE FALSE TRUE FALSE
#2 231 001 444 TRUE TRUE FALSE FALSE FALSE FALSE TRUE
#3 245 777 001 TRUE FALSE TRUE FALSE TRUE FALSE FALSE
Benchmark on my machine since I was curious. Just wanted to compare the mtabulate
to the lapply
. Not including the <-
:
microbenchmark::microbenchmark(mtab = mtabulate(as.data.frame(t(df1)))!=0,
lapply = lapply(uniq_dxs, function(x) rowSums(df1==x) > 0))
Unit: microseconds
expr min lq mean median uq max neval
mtab 1039.317 1088.9120 1182.3375 1109.334 1145.255 5931.031 100
lapply 742.838 795.7155 823.7991 813.220 843.488 1034.211 100