R: count consecutive occurrences of values in a single column

社会主义新天地 提交于 2019-11-26 12:21:24

You need to use sequence and rle:

> sequence(rle(as.character(dataset$input))$lengths)
 [1] 1 1 2 1 2 1 1 2 3 4 1 1

An efficient and more straightforward version of the function written below is available now in data.table package, called rleid. Using that, it's just:

setDT(dataset)[, counter := seq_len(.N), by=rleid(input)]

See ?rleid for more on usage and examples. Thanks to @Henrik for the suggestion to update this post.


rle is definitely the most convenient way to do it (+1 @Ananda's). But one could do better (in terms of speed) on bigger data. You can use the duplist and vecseq functions (not exported) from data.table as follows:

require(data.table)
arun <- function(y) {
    w = data.table:::duplist(list(y))
    w = c(diff(w), length(y)-tail(w,1L)+1L)
    data.table:::vecseq(rep(1L, length(w)), w, length(y))
}

x <- c("a","b","b","a","a","c","a","a","a","a","b","c")
arun(x)
# [1] 1 1 2 1 2 1 1 2 3 4 1 1

Benchmarking on big data:

set.seed(1)
x <- sample(letters, 1e6, TRUE)
# rle solution
ananda <- function(y) {
    sequence(rle(y)$lengths)
}

require(microbenchmark)
microbenchmark(a1 <- arun(x), a2<-ananda(x), times=100)
Unit: milliseconds
            expr       min        lq    median       uq       max neval
   a1 <- arun(x)  123.2827  132.6777  163.3844  185.439  563.5825   100
 a2 <- ananda(x) 1382.1752 1899.2517 2066.4185 2247.233 3764.0040   100

identical(a1, a2) # [1] TRUE

Package runner has dedicated solution to compute what needed. streak_run is the fastest solution and accepts vector as input.

library(microbenchmark); library(runner)

x      <- sample(letters, 1e6, TRUE)
ananda <- function(y) sequence(rle(y)$lengths)

microbenchmark( a2<-ananda(x), runner <- streak_run(x), times=100)

#Unit: milliseconds
#                expr     min      lq     mean  median       uq      max neval
#     a2 <- ananda(x) 580.744 718.117 1059.676 944.073 1399.649 1699.293    10
#run <- streak_run(x)  37.682  39.568   42.277  40.591   43.947   52.917    10

identical(a2, run)
#[1] TRUE
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!