Cumulative sums over run lengths. Can this loop be vectorized?

后端 未结 3 1132
伪装坚强ぢ
伪装坚强ぢ 2020-12-18 10:46

I have a data frame on which I calculate a run length encoding for a specific column. The values of the column, dir, are either -1, 0, or 1.

dir.r

相关标签:
3条回答
  • 2020-12-18 11:03

    This can be broken down into a two step problem. First, if we create an indexing column based off of the rle, then we can use that to group by and run the cumsum. The group by can then be performed by any number of aggregation techniques. I'll show two options, one using data.table and the other using plyr.

    library(data.table)
    library(plyr)
    #data.table is the same thing as a data.frame for most purposes
    #Fake data
    dat <- data.table(dir = sample(-1:1, 20, TRUE), value = rnorm(20))
    dir.rle <- rle(dat$dir)
    #Compute an indexing column to group by
    dat <- transform(dat, indexer = rep(1:length(dir.rle$lengths), dir.rle$lengths))
    
    
    #What does the indexer column look like?
    > head(dat)
         dir      value indexer
    [1,]   1  0.5045807       1
    [2,]   0  0.2660617       2
    [3,]   1  1.0369641       3
    [4,]   1 -0.4514342       3
    [5,]  -1 -0.3968631       4
    [6,]  -1 -2.1517093       4
    
    
    #data.table approach
    dat[, cumsum(value), by = indexer]
    
    #plyr approach
    ddply(dat, "indexer", summarize, V1 = cumsum(value))
    
    0 讨论(0)
  • 2020-12-18 11:06

    Add a 'group' column to the data frame. Something like:

    df=data.frame(z=rnorm(100)) # dummy data
    df$dir = sign(df$z) # dummy +/- 1
    rl = rle(df$dir)
    df$group = rep(1:length(rl$lengths),times=rl$lengths)
    

    then use tapply to sum within groups:

    tapply(df$z,df$group,sum)
    
    0 讨论(0)
  • 2020-12-18 11:14

    Both Spacedman & Chase make the key point that a grouping variable simplifies everything (and Chase lays out two nice ways to proceed from there).

    I'll just throw in an alternative approach to forming that grouping variable. It doesn't use rle and, at least to me, feels more intuitive. Basically, at each point where diff() detects a change in value, the cumsum that will form your grouping variable is incremented by one:

    df$group <- c(0, cumsum(!(diff(df$dir)==0)))
    
    # Or, equivalently
    df$group <- c(0, cumsum(as.logical(diff(df$dir))))
    
    0 讨论(0)
提交回复
热议问题