df %>% split(.$x)
becomes slow for large number of unique values of x. If we instead split the data frame manually into smaller subsets and then
More an explanation than an answer. Sub-setting a large data.frame is more costly than sub-setting a small data frame
> df100 = df[1:100,]
> idx = c(1, 10, 20)
> microbenchmark(df[idx,], df100[idx,], times=10)
Unit: microseconds
expr min lq mean median uq max neval
df[idx, ] 428.921 441.217 445.3281 442.893 448.022 475.364 10
df100[idx, ] 32.082 32.307 35.2815 34.935 37.107 42.199 10
split()
pays this cost for each group.
The reason can be seen by running Rprof()
> Rprof(); for (i in 1:1000) df[idx,]; Rprof(NULL); summaryRprof()
$by.self
self.time self.pct total.time total.pct
"attr" 1.26 100 1.26 100
$by.total
total.time total.pct self.time self.pct
"attr" 1.26 100 1.26 100
"[.data.frame" 1.26 100 0.00 0
"[" 1.26 100 0.00 0
$sample.interval
[1] 0.02
$sampling.time
[1] 1.26
All of the time is being spent in a call to attr()
. Stepping through the code using debug("[.data.frame")
shows that the pain involves a call like
attr(df, "row.names")
This small example shows a trick that R uses to avoid representing row names that are not present: use c(NA, -5L)
, rather than 1:5
.
> dput(data.frame(x=1:5))
structure(list(x = 1:5), .Names = "x", row.names = c(NA, -5L), class = "data.frame")
Note that attr()
returns a vector -- the row.names are created on the fly, and for a large data.frame a large number of row.names are created
> attr(data.frame(x=1:5), "row.names")
[1] 1 2 3 4 5
So one might expect that even nonsensical row.names would speed the calculation
> dfns = df; rownames(dfns) = rev(seq_len(nrow(dfns)))
> system.time(split(dfns, dfns$x))
user system elapsed
4.048 0.000 4.048
> system.time(split(df, df$x))
user system elapsed
87.772 16.312 104.100
Splitting a vector or matrix would also be fast.