Use rle to group by runs when using dplyr

匿名 (未验证) 提交于 2019-12-03 01:23:02

问题:

In R, I want to summarize my data after grouping it based on the runs of a variable x (aka each group of the data corresponds to a subset of the data where consecutive x values are the same). For instance, consider the following data frame, where I want to compute the average y value within each run of x:

(dat 

In this example, the x variable has runs of length 3, then 2, then 1, and finally 1, taking values 1, 2, 1, and 2 in those four runs. The corresponding means of y in those groups are 2, 4.5, 6, and 7.

It is easy to carry out this grouped operation in base R using tapply, passing dat$y as the data, using rle to compute the run number from dat$x, and passing the desired summary function:

tapply(dat$y, with(rle(dat$x), rep(seq_along(lengths), lengths)), mean) #   1   2   3   4  # 2.0 4.5 6.0 7.0  

I figured I would be able to pretty directly carry over this logic to dplyr, but my attempts so far have all ended in errors:

library(dplyr) # First attempt dat %>%   group_by(with(rle(x), rep(seq_along(lengths), lengths))) %>%   summarize(mean(y)) # Error: cannot coerce type 'closure' to vector of type 'integer'  # Attempt 2 -- maybe "with" is the problem? dat %>%   group_by(rep(seq_along(rle(x)$lengths), rle(x)$lengths)) %>%   summarize(mean(y)) # Error: invalid subscript type 'closure' 

For completeness, I could reimplement the rle run id myself using cumsum, head, and tail to get around this, but it makes the grouping code tougher to read and involves a bit of reinventing the wheel:

dat %>%   group_by(run=cumsum(c(1, head(x, -1) != tail(x, -1)))) %>%   summarize(mean(y)) #     run mean(y) #   (dbl)   (dbl) # 1     1     2.0 # 2     2     4.5 # 3     3     6.0 # 4     4     7.0 

What is causing my rle-based grouping code to fail in dplyr, and is there any solution that enables me to keep using rle when grouping by run id?

回答1:

One option seems to be the use of {} as in:

dat %>%     group_by(yy = {yy = rle(x); rep(seq_along(yy$lengths), yy$lengths)}) %>%     summarize(mean(y)) #Source: local data frame [4 x 2] # #     yy mean(y) #  (int)   (dbl) #1     1     2.0 #2     2     4.5 #3     3     6.0 #4     4     7.0 

It would be nice if future dplyr versions also had an equivalent of data.table's rleid function.


I noticed that this problem occurs when using a data.frame or tbl_df input but not, when using a tbl_dt or data.table input:

dat %>%      tbl_df %>%      group_by(yy = with(rle(x), rep(seq_along(lengths), lengths))) %>%     summarize(mean(y)) Error: cannot coerce type 'closure' to vector of type 'integer'  dat %>%      tbl_dt %>%      group_by(yy = with(rle(x), rep(seq_along(lengths), lengths))) %>%     summarize(mean(y)) Source: local data table [4 x 2]       yy mean(y)   (int)   (dbl) 1     1     2.0 2     2     4.5 3     3     6.0 4     4     7.0 

I reported this as an issue on dplyr's github page.



回答2:

If you explicitly create a grouping variable g it more or less works:

> dat %>% transform(g=with(rle(dat$x),{ rep(seq_along(lengths), lengths)}))%>%                                     group_by(g) %>% summarize(mean(y)) Source: local data frame [4 x 2]        g mean(y)   (int)   (dbl) 1     1     2.0 2     2     4.5 3     3     6.0 4     4     7.0 

I used transform here because mutate throws an error.



标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!