In R, I want to summarize my data after grouping it based on the runs of a variable x
(aka each group of the data corresponds to a subset of the data where cons
One option seems to be the use of {}
as in:
dat %>%
group_by(yy = {yy = rle(x); rep(seq_along(yy$lengths), yy$lengths)}) %>%
summarize(mean(y))
#Source: local data frame [4 x 2]
#
# yy mean(y)
# (int) (dbl)
#1 1 2.0
#2 2 4.5
#3 3 6.0
#4 4 7.0
It would be nice if future dplyr versions also had an equivalent of data.table's rleid
function.
I noticed that this problem occurs when using a data.frame
or tbl_df
input but not, when using a tbl_dt
or data.table
input:
dat %>%
tbl_df %>%
group_by(yy = with(rle(x), rep(seq_along(lengths), lengths))) %>%
summarize(mean(y))
Error: cannot coerce type 'closure' to vector of type 'integer'
dat %>%
tbl_dt %>%
group_by(yy = with(rle(x), rep(seq_along(lengths), lengths))) %>%
summarize(mean(y))
Source: local data table [4 x 2]
yy mean(y)
(int) (dbl)
1 1 2.0
2 2 4.5
3 3 6.0
4 4 7.0
I reported this as an issue on dplyr's github page.
If you explicitly create a grouping variable g
it more or less works:
> dat %>% transform(g=with(rle(dat$x),{ rep(seq_along(lengths), lengths)}))%>%
group_by(g) %>% summarize(mean(y))
Source: local data frame [4 x 2]
g mean(y)
(int) (dbl)
1 1 2.0
2 2 4.5
3 3 6.0
4 4 7.0
I used transform
here because mutate
throws an error.