split
is a generic. Whereas split.default
is quite fast, split.data.frame
gets terribly slow when the number of levels to split on increases.
The alternate (faster) solution would be to use data.table
. I'll illustrate the difference on a bigger data here:
Sample data (what @Roland was referring to in his comment)
require(data.table)
set.seed(45)
DF <- data.frame(ids = sample(1e4, 1e6, TRUE), x = sample(letters, 1e6, TRUE),
y = runif(1e6))
DT <- as.data.table(DF)
Functions + benchmarking
Note that the order of the data will be different here as split sorts by "ids". IF you want that you can first do setkey(DT, ids)
and then run f2
.
f1 <- function() split(DF, DF$ids)
f2 <- function() {
ans <- DT[, list(list(.SD)), by=ids]$V1
setattr(ans, 'names', unique(DT$ids)) # sets names by reference, no copy here.
}
require(microbenchmark)
microbenchmark(ans1 <- f1(), ans2 <- f2(), times=10)
# Unit: milliseconds
# expr min lq median uq max neval
# ans1 <- f1() 37015.9795 43994.6629 48132.3364 49086.0926 63829.592 10
# ans2 <- f2() 332.6094 361.1902 409.2191 528.0674 1005.457 10
split.data.frame
took an average of 48 seconds wheres data.table
took 0.41 seconds