Cumulative aggregates within tidyverse

问题

say I have a tibble (or data.table) which consists of two columns:

a <- tibble(id = rep(c("A", "B"), each = 6), val = c(1, 0, 0, 1 ,0,1,0,0,0,1,1,1))

Furthermore I have a function called myfun which takes a numeric vector of arbitrary length as input and returns a single number. For example, you can think of myfun as being the standard deviation.

Now I would like to create a third column to my tibble (called result) which contains the outputs of myfun applied to val cumulated and grouped with respect to id. For example, the first entry of result should contain mfun(val[1]). The second entry should contain myfun(val[1:2]), and so on. I would like to implent a cumulated version of myfun.

Of course there a lot of easy solutions outside the tidyverse using loops and what not. But I would be interested in a solution within the tidyverse or within the data.table frame work.

Any help is appreciated.

回答1:

You could do it this way:

library(tidyverse)

a %>% 
  group_by(id) %>% 
  mutate(y = map_dbl(seq_along(val),~sd(val[1:.x]))) %>%
  ungroup

# # A tibble: 12 x 3
#       id   val         y
#    <chr> <dbl>     <dbl>
#  1     A     1        NA
#  2     A     0 0.7071068
#  3     A     0 0.5773503
#  4     A     1 0.5773503
#  5     A     0 0.5477226
#  6     A     1 0.5477226
#  7     B     0        NA
#  8     B     0 0.0000000
#  9     B     0 0.0000000
# 10     B     1 0.5000000
# 11     B     1 0.5477226
# 12     B     1 0.5477226

Explanation

We first group like often with tidyverse chains, then we use mutate, and not summarize, as we want to keep the same unaggregated rows.

The function map_dbl is here used to loop on a vector of final indices. seq_along(val) will be 1:6 for both groups here.

Using functions from the map family we can use the ~ notation, which will assume the first parameter of the function is named .x.

Looping through these indices we compute first sd(val[1:1]) which is sd(val[1]) which is NA, then sd(val[1:2]) etc...

map_dbl returns by design a vector of doubles, and these are stacked in the y column.

回答2:

One can use zoo::rollapplyr with dynamic width (vector containing width). To prepare a dynamic width for each group 1:n() or seq(n()) can be used.

Let's apply it for function sd using data provided by OP :

library(dplyr)
library(zoo)

a %>% group_by(id) %>%
  mutate(y = rollapplyr(val, 1:n(), sd ))

#   # Groups: id [2]
#   id      val      y
#   <chr> <dbl>  <dbl>
#  1 A      1.00 NA    
#  2 A      0     0.707
#  3 A      0     0.577
#  4 A      1.00  0.577
#  5 A      0     0.548
#  6 A      1.00  0.548
#  7 B      0    NA    
#  8 B      0     0    
#  9 B      0     0    
# 10 B      1.00  0.500
# 11 B      1.00  0.548
# 12 B      1.00  0.548

来源：https://stackoverflow.com/questions/50599976/cumulative-aggregates-within-tidyverse

标签

dataframe

dplyr

tidyverse

purrr