Improving performance of split() function in R?

萝らか妹 提交于 2019-12-04 04:05:41

问题


I have a data frame in a very simple form:

    X Y
    ---
    A 1
    A 2
    B 3
    C 1
    C 3

My end result should be a list like this:

$`A`
[1] 1 2

$`B`
[1] 3

$`C`
[1] 1 3

For this operation I am using the split() function in R:

k <- split(Y, X)

This is working just fine. However, if I want to apply this code on a data frame containing 22 million rows including 10 million groups for X and 387000 values for Y it becomes really time consuming. I tried using the RRO 8.0 open version for MKL support. However, still only one Kernel is used. The CPU has 64 GB of RAM so that shouldn't be an issue.

Any ideas for a smarter way to compute this?


回答1:


Try

 library(data.table)
 DT <- as.data.table(df)
 DT1 <- DT[, list(Y=list(Y)), by=X]
 DT1$Y
 #[[1]]
 #[1] 1 2

 #[[2]]
 #[1] 3

 #[[3]]
 #[1] 1 3

Or using dplyr

 library(dplyr)
 df1 <-  df %>% 
             group_by(X) %>%
              do(Y=c(.$Y))

 df1$Y
 #[[1]]
 #[1] 1 2

 #[[2]]
 #[1] 3

 #[[3]]
 #[1] 1 3

data

 df <- structure(list(X = c("A", "A", "B", "C", "C"), Y = c(1L, 2L, 
 3L, 1L, 3L)), .Names = c("X", "Y"), class = "data.frame", row.names = c(NA, 
 -5L))



回答2:


I found an elegant solution using similar code from dplyr and/or data.table. I looked for concatenate groups in R and I found this post:

Efficiently concate character content within one column, by group in R

And actually, it works quite nicely with

dt = data.table(content = sample(letters, 26e6, T), groups = LETTERS)
df = as.data.frame(dt)

system.time(dt[, paste(content, collapse = " "), by = groups])
#   user  system elapsed 
#   5.37    0.06    5.65 

system.time(df %>% group_by(groups) %>% summarise(paste(content, collapse = " ")))
#   user  system elapsed 
#   7.10    0.13    7.67 

Thanks for all your help



来源:https://stackoverflow.com/questions/27297843/improving-performance-of-split-function-in-r

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!