vectorize cumsum by factor in R

会有一股神秘感。 提交于 2019-12-04 08:59:49

With the modified example input/output you could use the following base R approach (among others):

transform(df, cum.goodpX = ave(goodp, itemcode, cumsum(goodp == 0), FUN = cumsum))
#   itemcode goodp cum.goodp cum.goodpX
#1        a1     0         0          0
#2        a1     1         1          1
#3        a1     1         2          2
#4        a1     0         0          0
#5        a1     1         1          1
#6        a2     1         1          1
#7        a2     1         2          2
#8        a3     0         0          0
#9        a4     0         0          0
#10       a4     1         1          1
#11       a5     1         1          1
#12       a6     1         1          1
#13       a6     1         2          2
#14       a6     0         0          0
#15       a6     1         1          1

Note: I added column cum.goodp to the input df and created a new column cum.goodpX so you can easily compare the two.

But of course you can use many other approaches with packages, either what @MartinMorgan suggested or for example using dplyr or data.table, to name just two options. Those may be a lot faster than base R approaches for large data sets.

Here's how it would be done in dplyr:

library(dplyr)
df %>% 
   group_by(itemcode, grp = cumsum(goodp == 0)) %>% 
   mutate(cum.goodpX = cumsum(goodp))

A data.table option was already provided in the comments to your question.

A base R approach is to calculate cumsum over the whole vector, and capture the geometry of the sub-lists using run-length encoding. Figure out the start of each group, and create new groups

start <- c(TRUE, itemcode[-1] != itemcode[-length(itemcode)]) | !goodp
f <- cumsum(start)

Summarize these as a run-length encoding, and calculate the overall sum

r <- rle(f)
x <- cumsum(x)

Then use the geometry to get the offset that each embedded sum needs to be corrected by

offset <- c(0, x[cumsum(r$lengths)])

and calculate the updated value

x - rep(offset[-length(offset)], r$lengths)

Here's a function

cumsumByGroup <- function(x, f) {
    start <- c(TRUE, f[-1] != f[-length(f)]) | !x
    r <- rle(cumsum(start))
    x <- cumsum(x)
    offset <- c(0, x[cumsum(r$lengths)])
    x - rep(offset[-length(offset)], r$lengths)
}

Here's the result applied to the sample data

> cumsumByGroup(goodp, itemcode)
 [1] 0 1 2 0 1 1 2 0 0 1 1 1 2 0 1

and it's performance

> n <- 1 + rpois(1000000, 1)
> goodp <- sample(c(0, 1), sum(n), TRUE)
> itemcode <- rep(seq_along(n), n)
> system.time(cumsumByGroup(goodp, itemcode))
   user  system elapsed 
   0.55    0.00    0.55 

The dplyr solution takes about 70s.

@alexis_laz solution is both elegant and 2 times faster than mine

cumsumByGroup1 <- function(x, f) {
    start <- c(TRUE, f[-1] != f[-length(f)]) | !x
    cs = cumsum(x)
    cs - cummax((cs - x) * start)
}
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!