Search multiple columns for string to set indicator variable

前端 未结 3 1828
一整个雨季
一整个雨季 2021-01-23 13:57

I am using R and RStudio for the first time to work with a very large dataset (15 million cases) with many columns of data. To facilitate analysis, I need to search a range of

3条回答
  •  悲&欢浪女
    2021-01-23 14:42

    Another idea using base R with lapply:

    uniq_dxs <- as.character(unique(melt(df1, id.vars = NULL)$value))
    df1[, paste0("var", uniq_dxs)] <- lapply(uniq_dxs, function(x) rowSums(df1==x) > 0)
    
    df1
    #  Dx1 Dx2 Dx3 var001 var231 var245 var234 var777 var456 var444
    #1 001 234 456   TRUE  FALSE  FALSE   TRUE  FALSE   TRUE  FALSE
    #2 231 001 444   TRUE   TRUE  FALSE  FALSE  FALSE  FALSE   TRUE
    #3 245 777 001   TRUE  FALSE   TRUE  FALSE   TRUE  FALSE  FALSE
    

    Benchmark on my machine since I was curious. Just wanted to compare the mtabulate to the lapply. Not including the <-:

    microbenchmark::microbenchmark(mtab = mtabulate(as.data.frame(t(df1)))!=0,
                                   lapply = lapply(uniq_dxs, function(x) rowSums(df1==x) > 0))
    Unit: microseconds
       expr      min        lq      mean   median       uq      max neval
       mtab 1039.317 1088.9120 1182.3375 1109.334 1145.255 5931.031   100
     lapply  742.838  795.7155  823.7991  813.220  843.488 1034.211   100
    

提交回复
热议问题