Why is split inefficient on large data frames with many groups?

前端 未结 3 609
心在旅途
心在旅途 2021-01-12 07:09
df %>% split(.$x)

becomes slow for large number of unique values of x. If we instead split the data frame manually into smaller subsets and then

3条回答
  •  不知归路
    2021-01-12 07:46

    More an explanation than an answer. Sub-setting a large data.frame is more costly than sub-setting a small data frame

    > df100 = df[1:100,]
    > idx = c(1, 10, 20)
    > microbenchmark(df[idx,], df100[idx,], times=10)
    Unit: microseconds
             expr     min      lq     mean  median      uq     max neval
        df[idx, ] 428.921 441.217 445.3281 442.893 448.022 475.364    10
     df100[idx, ]  32.082  32.307  35.2815  34.935  37.107  42.199    10
    

    split() pays this cost for each group.

    The reason can be seen by running Rprof()

    > Rprof(); for (i in 1:1000) df[idx,]; Rprof(NULL); summaryRprof()
    $by.self
           self.time self.pct total.time total.pct
    "attr"      1.26      100       1.26       100
    
    $by.total
                   total.time total.pct self.time self.pct
    "attr"               1.26       100      1.26      100
    "[.data.frame"       1.26       100      0.00        0
    "["                  1.26       100      0.00        0
    
    $sample.interval
    [1] 0.02
    
    $sampling.time
    [1] 1.26
    

    All of the time is being spent in a call to attr(). Stepping through the code using debug("[.data.frame") shows that the pain involves a call like

    attr(df, "row.names")
    

    This small example shows a trick that R uses to avoid representing row names that are not present: use c(NA, -5L), rather than 1:5.

    > dput(data.frame(x=1:5))
    structure(list(x = 1:5), .Names = "x", row.names = c(NA, -5L), class = "data.frame")
    

    Note that attr() returns a vector -- the row.names are created on the fly, and for a large data.frame a large number of row.names are created

    > attr(data.frame(x=1:5), "row.names")
    [1] 1 2 3 4 5
    

    So one might expect that even nonsensical row.names would speed the calculation

    > dfns = df; rownames(dfns) = rev(seq_len(nrow(dfns)))
    > system.time(split(dfns, dfns$x))
       user  system elapsed 
      4.048   0.000   4.048 
    > system.time(split(df, df$x))
       user  system elapsed 
     87.772  16.312 104.100 
    

    Splitting a vector or matrix would also be fast.

提交回复
热议问题