Why is using dplyr pipe (%>%) slower than an equivalent non-pipe expression, for high-cardinality group-by?

问题

I thought that generally speaking using %>% wouldn't have a noticeable effect on speed. But in this case it runs 4x slower.

library(dplyr)
library(microbenchmark)

set.seed(0)
dummy_data <- dplyr::data_frame(
  id=floor(runif(10000, 1, 10000))
  , label=floor(runif(10000, 1, 4))
)

microbenchmark(dummy_data %>% group_by(id) %>% summarise(list(unique(label))))
microbenchmark(dummy_data %>% group_by(id) %>% summarise(label %>% unique %>% list))

Without pipe:

min       lq     mean   median       uq      max neval
1.691441 1.739436 1.841157 1.812778 1.880713 2.495853   100

With pipe:

min       lq     mean   median       uq      max neval
6.753999 6.969573 7.167802 7.052744 7.195204 8.833322   100

Why is %>% so much slower in this situation? Is there a better way to write this?

EDIT:

I made the data frame smaller and incorporated Moody_Mudskipper's suggestions into the benchmarking.

microbenchmark(
  nopipe=dummy_data %>% group_by(id) %>% summarise(list(unique(label))),
  magrittr=dummy_data %>% group_by(id) %>% summarise(label %>% unique %>% list),
  magrittr2=dummy_data %>% group_by(id) %>% summarise_at('label', . %>% unique %>% list),
  fastpipe=dummy_data %.% group_by(., id) %.% summarise(., label %.% unique(.) %.% list(.))
)

Unit: milliseconds
      expr       min        lq      mean    median        uq      max neval
    nopipe  59.91252  70.26554  78.10511  72.79398  79.29025 214.9245   100
  magrittr 469.09573 525.80084 568.28918 558.05634 590.48409 767.4647   100
 magrittr2  84.06716  95.20952 106.28494 100.32370 110.92373 241.1296   100
  fastpipe  93.57549 103.36926 109.94614 107.55218 111.90049 162.7763   100

回答1:

What might be a negligible effect in a real-world full application becomes non-negligible when writing one-liners that are time-dependent on the formerly "negligible". I suspect if you profile your tests then most of the time will be in the summarize clause, so lets microbenchmark something similar to that:

> set.seed(99);z=sample(10000,4,TRUE)
> microbenchmark(z %>% unique %>% list, list(unique(z)))
Unit: microseconds
                  expr     min      lq      mean   median      uq     max neval
 z %>% unique %>% list 142.617 144.433 148.06515 145.0265 145.969 297.735   100
       list(unique(z))   9.289   9.988  10.85705  10.5820  11.804  12.642   100

This is doing something a bit different to your code but illustrates the point. Pipes are slower.

Because pipes need to restructure R's calling into the same one that function evaluations are using, and then evaluate them. So it has to be slower. By how much depends on how speedy the functions are. Calls to unique and list are pretty fast in R, so the whole difference here is the pipe overhead.

Profiling expressions like this showed me most of the time is spent in the pipe functions:

                         total.time total.pct self.time self.pct
"microbenchmark"              16.84     98.71      1.22     7.15
"%>%"                         15.50     90.86      1.22     7.15
"eval"                         5.72     33.53      1.18     6.92
"split_chain"                  5.60     32.83      1.92    11.25
"lapply"                       5.00     29.31      0.62     3.63
"FUN"                          4.30     25.21      0.24     1.41
 ..... stuff .....

then somewhere down in about 15th place the real work gets done:

"as.list"                      1.40      8.13      0.66     3.83
"unique"                       1.38      8.01      0.88     5.11
"rev"                          1.26      7.32      0.90     5.23

Whereas if you just call the functions as Chambers intended, R gets straight down to it:

                         total.time total.pct self.time self.pct
"microbenchmark"               2.30     96.64      1.04    43.70
"unique"                       1.12     47.06      0.38    15.97
"unique.default"               0.74     31.09      0.64    26.89
"is.factor"                    0.10      4.20      0.10     4.20

Hence the oft-quoted recommendation that pipes are okay on the command line where your brain thinks in chains, but not in functions that might be time-critical. In practice this overhead will probably get wiped out in one call to glm with a few hundred data points, but that's another story....

回答2:

But here is something I have learnt today. I am using R 3.5.0.

Code with x = 100 (1e2)

library(microbenchmark)
library(dplyr)

set.seed(99)
x <- 1e2
z <- sample(x, x / 2, TRUE)
timings <- microbenchmark(
  dp = z %>% unique %>% list, 
  bs = list(unique(z)))

print(timings)

Unit: microseconds
 expr    min      lq      mean   median       uq     max neval
   dp 99.055 101.025 112.84144 102.7890 109.2165 312.359   100
   bs  6.590   7.653   9.94989   8.1625   8.9850  63.790   100

Although, if x = 1e6

Unit: milliseconds
 expr      min       lq     mean   median       uq      max neval
   dp 27.77045 31.78353 35.09774 33.89216 38.26898  52.8760   100
   bs 27.85490 31.70471 36.55641 34.75976 39.12192 138.7977   100

回答3:

So, I finally got around to running the expressions in OP's question:

set.seed(0)
dummy_data <- dplyr::data_frame(
  id=floor(runif(100000, 1, 100000))
  , label=floor(runif(100000, 1, 4))
)

microbenchmark(dummy_data %>% group_by(id) %>% summarise(list(unique(label))))
microbenchmark(dummy_data %>% group_by(id) %>% summarise(label %>% unique %>% list))

This took so long that I thought I'd run into a bug, and force-interrupted R.

Trying again, with the number of repetitions cut down, I got the following times:

microbenchmark(
    b=dummy_data %>% group_by(id) %>% summarise(list(unique(label))),
    d=dummy_data %>% group_by(id) %>% summarise(label %>% unique %>% list),
    times=2)

#Unit: seconds
# expr      min       lq     mean   median       uq      max neval
#    b 2.091957 2.091957 2.162222 2.162222 2.232486 2.232486     2
#    d 7.380610 7.380610 7.459041 7.459041 7.537471 7.537471     2

The times are in seconds! So much for milliseconds or microseconds. No wonder it seemed like R had hung at first, with the default value of times=100.

But why is it taking so long? First, the way the dataset is constructed, the id column contains about 63000 values:

length(unique(dummy_data$id))
#[1] 63052

Second, the expression that is being summarised over in turn contains several pipes, and each set of grouped data is going to be relatively small.

This is essentially the worst-case scenario for a piped expression: it's being called very many times, and each time, it's operating over a very small set of inputs. This results in plenty of overhead, and not much computation for that overhead to be amortised over.

By contrast, if we just switch the variables that are being grouped and summarized over:

microbenchmark(
    b=dummy_data %>% group_by(label) %>% summarise(list(unique(id))),
    d=dummy_data %>% group_by(label) %>% summarise(id %>% unique %>% list),
    times=2)

#Unit: milliseconds
# expr      min       lq     mean   median       uq      max neval
#    b 12.00079 12.00079 12.04227 12.04227 12.08375 12.08375     2
#    d 10.16612 10.16612 12.68642 12.68642 15.20672 15.20672     2

Now everything looks much more equal.

回答4:

magrittr's pipe is coded around the concept of functional chain.

You can create one by starting with a dot : . %>% head() %>% dim(), it's a compact way of writing a function.

When using a standard pipe call such as iris %>% head() %>% dim(), the functional chain . %>% head() %>% dim() will still be computed first, causing an overhead.

The functional chain is a bit of a strange animal :

(. %>% head()) %>% dim
#> NULL

When you look at the call . %>% head() %>% dim() , it actually parses as `%>%`( `%>%`(., head()), dim()). Basically, sorting things out requires some manipulation that takes a bit of time.

Another thing that takes a bit of time is to handle the different cases of rhs such as in iris %>% head, iris %>% head(.), iris %>% {head(.)} etc, to insert a dot at the right place when relevant.

You can build a very fast pipe the following way :

`%.%` <- function (lhs, rhs) {
    rhs_call <- substitute(rhs)
    eval(rhs_call, envir = list(. = lhs), enclos = parent.frame())
}

It will be much faster than magrittr's pipe and will actually behave better with edge cases, but will require explicit dots and obviously won't support functional chains.

library(magrittr)
`%.%` <- function (lhs, rhs) {
  rhs_call <- substitute(rhs)
  eval(rhs_call, envir = list(. = lhs), enclos = parent.frame())
}
bench::mark(relative = T,
  "%>%" =
    1 %>% identity %>% identity() %>% (identity) %>% {identity(.)},
  "%.%" = 
    1 %.% identity(.) %.% identity(.) %.% identity(.) %.% identity(.)
)
#> # A tibble: 2 x 6
#>   expression   min median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr> <dbl>  <dbl>     <dbl>     <dbl>    <dbl>
#> 1 %>%         15.9   13.3       1        4.75     1   
#> 2 %.%          1      1        17.0      1        1.60

^{Created on 2019-10-05 by the reprex package (v0.3.0)}

Here it was clocked at being 13. times faster.

I included it in my experimental fastpipe package, named as %>>%.

Now, we can also leverage the power of functional chains directly with a simple change to your call :

dummy_data %>% group_by(id) %>% summarise_at('label', . %>% unique %>% list)

It will be much faster because the functional chain is only parsed once and then internally it just applies functions one after another in a loop, very close to your base solution. My fast pipe on the other hand still adds a small overhead due to the eval / substitute done for every loop instance and every pipe.

Here's a benchmark including those 2 new solutions :

microbenchmark::microbenchmark(
  nopipe=dummy_data %>% group_by(id) %>% summarise(label = list(unique(label))),
  magrittr=dummy_data %>% group_by(id) %>% summarise(label = label %>% unique %>% list),
  functional_chain=dummy_data %>% group_by(id) %>% summarise_at('label', . %>% unique %>% list),
  fastpipe=dummy_data %.% group_by(., id) %.% summarise(., label =label %.% unique(.) %.% list(.)),
  times = 10
)

#> Unit: milliseconds
#>              expr      min       lq     mean    median       uq      max neval cld
#>            nopipe  42.2388  42.9189  58.0272  56.34325  66.1304  80.5491    10  a 
#>          magrittr 512.5352 571.9309 625.5392 616.60310 670.3800 811.1078    10   b
#>  functional_chain  64.3320  78.1957 101.0012  99.73850 126.6302 148.7871    10  a 
#>          fastpipe  66.0634  87.0410 101.9038  98.16985 112.7027 172.1843    10  a

来源：https://stackoverflow.com/questions/35933272/why-is-using-dplyr-pipe-slower-than-an-equivalent-non-pipe-expression-for

标签

performance

dplyr

magrittr

cardinality