Remove/collapse consecutive duplicate values in sequence

前端 未结 4 1148
一向
一向 2020-11-27 19:08

I have the following dataframe:

a a a b c c d e a a b b b e e d d

The required result should be

a b c d e a b e d         


        
相关标签:
4条回答
  • 2020-11-27 19:33

    Much as I like,... errr, love rle , here's a shootoff:

    EDIT: Can't figure out exactly what's up with dplyr so I used dplyr::lead . I'm on OSX, R3.1.2, and latest dplyr from CRAN.

    xlet<-sample(letters,1e5,rep=T)
    rleit<-function(x) rle(x)$values
    lagit<-function(x) x[x!=lead(x, default=1)]
    tailit<-function(x) x[x!=c(tail(x,-1), tail(x,1))]
    
    
    
      microbenchmark(rleit(xlet),lagit(xlet),tailit(xlet),times=20)
    Unit: milliseconds
             expr      min       lq   median       uq      max neval
      rleit(xlet) 27.43996 30.02569 30.20385 30.92817 37.10657    20
      lagit(xlet) 12.44794 15.00687 15.14051 15.80254 46.66940    20
     tailit(xlet) 12.48968 14.66588 14.78383 15.32276 55.59840    20
    
    0 讨论(0)
  • 2020-11-27 19:38
    library(dplyr)
    x <- c("a", "a", "a", "b", "c", "c", "d", "e", "a", "a", "b", "b", "b", "e", "e", "d", "d")
    x[x!=lag(x, default=1)]
    #[1] "a" "b" "c" "d" "e" "a" "b" "e" "d"
    

    EDIT: For data.frame

      mydf <- data.frame(
        V1 = c("a", "a", "a", "b", "c", "c", "d", "e", 
             "a", "a", "b", "b", "e", "e", "d", "d"),
        V2 = c(1, 2, 3, 2, 4, 1, 3, 9, 
             4, 8, 10, 199, 2, 5, 4, 10),
       stringsAsFactors=FALSE)
    

    dplyr solution is one liner:

    mydf %>% filter(V1!= lag(V1, default="1"))
    #  V1 V2
    #1  a  1
    #2  b  2
    #3  c  4
    #4  d  3
    #5  e  9
    #6  a  4
    #7  b 10
    #8  e  2
    #9  d  4
    

    post scriptum

    lead(x,1) suggested by @Carl Witthoft iterates in reverse order.

    leadit<-function(x) x!=lead(x, default="what")
    rows <- leadit(mydf[ ,1])
    mydf[rows, ]
    
    #   V1  V2
    #3   a   3
    #4   b   2
    #6   c   1
    #7   d   3
    #8   e   9
    #10  a   8
    #12  b 199
    #14  e   5
    #16  d  10
    
    0 讨论(0)
  • 2020-11-27 19:44

    One easy way is to use rle:

    Here's your sample data:

    x <- scan(what = character(), text = "a a a b c c d e a a b b b e e d d")
    # Read 17 items
    

    rle returns a list with two values: the run length ("lengths"), and the value that is repeated for that run ("values").

    rle(x)$values
    # [1] "a" "b" "c" "d" "e" "a" "b" "e" "d"
    

    Update: For a data.frame

    If you are working with a data.frame, try something like the following:

    ## Sample data
    mydf <- data.frame(
      V1 = c("a", "a", "a", "b", "c", "c", "d", "e", 
             "a", "a", "b", "b", "e", "e", "d", "d"),
      V2 = c(1, 2, 3, 2, 4, 1, 3, 9, 
             4, 8, 10, 199, 2, 5, 4, 10)
    )
    
    ## Use rle, as before
    X <- rle(mydf$V1)
    ## Identify the rows you want to keep
    Y <- cumsum(c(1, X$lengths[-length(X$lengths)]))
    Y
    # [1]  1  4  5  7  8  9 11 13 15
    mydf[Y, ]
    #    V1 V2
    # 1   a  1
    # 4   b  2
    # 5   c  4
    # 7   d  3
    # 8   e  9
    # 9   a  4
    # 11  b 10
    # 13  e  2
    # 15  d  4
    

    Update 2

    The "data.table" package has a function rleid that lets you do this quite easily. Using mydf from above, try:

    library(data.table)
    as.data.table(mydf)[, .SD[1], by = rleid(V1)]
    #    rleid V2
    # 1:     1  1
    # 2:     2  2
    # 3:     3  4
    # 4:     4  3
    # 5:     5  9
    # 6:     6  4
    # 7:     7 10
    # 8:     8  2
    # 9:     9  4
    
    0 讨论(0)
  • 2020-11-27 19:48

    With base R, I like funny algorithmics:

    x <- c("a", "a", "a", "b", "c", "c", "d", "e", "a", "a", "b", "b", "b", "e", "e", "d", "d")
    
    x[x!=c(x[-1], FALSE)]
    #[1] "a" "b" "c" "d" "e" "a" "b" "e" "d"
    
    0 讨论(0)
提交回复
热议问题