Use rle to group by runs when using dplyr

前端 未结 2 1878
庸人自扰
庸人自扰 2020-11-29 09:21

In R, I want to summarize my data after grouping it based on the runs of a variable x (aka each group of the data corresponds to a subset of the data where cons

相关标签:
2条回答
  • 2020-11-29 09:53

    One option seems to be the use of {} as in:

    dat %>%
        group_by(yy = {yy = rle(x); rep(seq_along(yy$lengths), yy$lengths)}) %>%
        summarize(mean(y))
    #Source: local data frame [4 x 2]
    #
    #     yy mean(y)
    #  (int)   (dbl)
    #1     1     2.0
    #2     2     4.5
    #3     3     6.0
    #4     4     7.0
    

    It would be nice if future dplyr versions also had an equivalent of data.table's rleid function.


    I noticed that this problem occurs when using a data.frame or tbl_df input but not, when using a tbl_dt or data.table input:

    dat %>% 
        tbl_df %>% 
        group_by(yy = with(rle(x), rep(seq_along(lengths), lengths))) %>%
        summarize(mean(y))
    Error: cannot coerce type 'closure' to vector of type 'integer'
    
    dat %>% 
        tbl_dt %>% 
        group_by(yy = with(rle(x), rep(seq_along(lengths), lengths))) %>%
        summarize(mean(y))
    Source: local data table [4 x 2]
    
         yy mean(y)
      (int)   (dbl)
    1     1     2.0
    2     2     4.5
    3     3     6.0
    4     4     7.0
    

    I reported this as an issue on dplyr's github page.

    0 讨论(0)
  • 2020-11-29 10:06

    If you explicitly create a grouping variable g it more or less works:

    > dat %>% transform(g=with(rle(dat$x),{ rep(seq_along(lengths), lengths)}))%>%                                   
     group_by(g) %>% summarize(mean(y))
    Source: local data frame [4 x 2]
    
          g mean(y)
      (int)   (dbl)
    1     1     2.0
    2     2     4.5
    3     3     6.0
    4     4     7.0
    

    I used transform here because mutate throws an error.

    0 讨论(0)
提交回复
热议问题