I have a data frame on which I calculate a run length encoding for a specific column. The values of the column, dir
, are either -1, 0, or 1.
dir.r
This can be broken down into a two step problem. First, if we create an indexing column based off of the rle
, then we can use that to group by and run the cumsum
. The group by can then be performed by any number of aggregation techniques. I'll show two options, one using data.table
and the other using plyr
.
library(data.table)
library(plyr)
#data.table is the same thing as a data.frame for most purposes
#Fake data
dat <- data.table(dir = sample(-1:1, 20, TRUE), value = rnorm(20))
dir.rle <- rle(dat$dir)
#Compute an indexing column to group by
dat <- transform(dat, indexer = rep(1:length(dir.rle$lengths), dir.rle$lengths))
#What does the indexer column look like?
> head(dat)
dir value indexer
[1,] 1 0.5045807 1
[2,] 0 0.2660617 2
[3,] 1 1.0369641 3
[4,] 1 -0.4514342 3
[5,] -1 -0.3968631 4
[6,] -1 -2.1517093 4
#data.table approach
dat[, cumsum(value), by = indexer]
#plyr approach
ddply(dat, "indexer", summarize, V1 = cumsum(value))
Add a 'group' column to the data frame. Something like:
df=data.frame(z=rnorm(100)) # dummy data
df$dir = sign(df$z) # dummy +/- 1
rl = rle(df$dir)
df$group = rep(1:length(rl$lengths),times=rl$lengths)
then use tapply to sum within groups:
tapply(df$z,df$group,sum)
Both Spacedman & Chase make the key point that a grouping variable simplifies everything (and Chase lays out two nice ways to proceed from there).
I'll just throw in an alternative approach to forming that grouping variable. It doesn't use rle
and, at least to me, feels more intuitive. Basically, at each point where diff()
detects a change in value, the cumsum
that will form your grouping variable is incremented by one:
df$group <- c(0, cumsum(!(diff(df$dir)==0)))
# Or, equivalently
df$group <- c(0, cumsum(as.logical(diff(df$dir))))