I have the following data frame
x <- read.table(text = \" id1 id2 val1 val2
1 a x 1 9
2 a x 2 4
3 a y 3 5
4 a y 4
Using the dplyr
package you could achieve this by using summarise_all
. With this summarise-function you can apply other functions (in this case mean
and n()
) to each of the non-grouping columns:
x %>%
group_by(id1, id2) %>%
summarise_all(funs(mean, n()))
which gives:
id1 id2 val1_mean val2_mean val1_n val2_n
1 a x 1.5 6.5 2 2
2 a y 3.5 7.0 2 2
3 b x 2.0 8.0 2 2
4 b y 3.0 6.0 2 2
If you don't want to apply the function(s) to all non-grouping columns, you specify the columns to which they should be applied or by excluding the non-wanted with a minus using the summarise_at()
function:
# inclusion
x %>%
group_by(id1, id2) %>%
summarise_at(vars(val1, val2), funs(mean, n()))
# exclusion
x %>%
group_by(id1, id2) %>%
summarise_at(vars(-val2), funs(mean, n()))
Another dplyr
option is across
which is part of current dev version
#devtools::install_github("tidyverse/dplyr")
library(dplyr)
x %>%
group_by(id1, id2) %>%
summarise(across(starts_with("val"), list(mean = mean, n = length)))
Result
# A tibble: 4 x 4
# Groups: id1 [2]
id1 id2 mean$val1 $val2 n$val1 $val2
<fct> <fct> <dbl> <dbl> <int> <int>
1 a x 1.5 6.5 2 2
2 a y 3.5 7 2 2
3 b x 2 8 2 2
4 b y 3 6 2 2
packageVersion("dplyr")
[1] ‘0.8.99.9000’
Given this in the question :
I could use the plyr package, but my data set is quite large and plyr is very slow (almost unusable) when the size of the dataset grows.
Then in data.table (1.9.4+
) you could try :
> DT
id1 id2 val1 val2
1: a x 1 9
2: a x 2 4
3: a y 3 5
4: a y 4 9
5: b x 1 7
6: b y 4 4
7: b x 3 9
8: b y 2 8
> DT[ , .(mean(val1), mean(val2), .N), by = .(id1, id2)] # simplest
id1 id2 V1 V2 N
1: a x 1.5 6.5 2
2: a y 3.5 7.0 2
3: b x 2.0 8.0 2
4: b y 3.0 6.0 2
> DT[ , .(val1.m = mean(val1), val2.m = mean(val2), count = .N), by = .(id1, id2)] # named
id1 id2 val1.m val2.m count
1: a x 1.5 6.5 2
2: a y 3.5 7.0 2
3: b x 2.0 8.0 2
4: b y 3.0 6.0 2
> DT[ , c(lapply(.SD, mean), count = .N), by = .(id1, id2)] # mean over all columns
id1 id2 val1 val2 count
1: a x 1.5 6.5 2
2: a y 3.5 7.0 2
3: b x 2.0 8.0 2
4: b y 3.0 6.0 2
For timings comparing aggregate
(used in question and all 3 other answers) to data.table
see
this benchmark (the agg
and agg.x
cases).
You can also use the plyr::each()
to introduce multiple functions:
aggregate(cbind(val1, val2) ~ id1 + id2, data = x, FUN = plyr::each(avg = mean, n = length))
Perhaps you want to merge?
x.mean <- aggregate(. ~ id1+id2, p, mean)
x.len <- aggregate(. ~ id1+id2, p, length)
merge(x.mean, x.len, by = c("id1", "id2"))
id1 id2 val1.x val2.x val1.y val2.y
1 a x 1.5 6.5 2 2
2 a y 3.5 7.0 2 2
3 b x 2.0 8.0 2 2
4 b y 3.0 6.0 2 2
You could add a count
column, aggregate with sum
, then scale back to get the mean
:
x$count <- 1
agg <- aggregate(. ~ id1 + id2, data = x,FUN = sum)
agg
# id1 id2 val1 val2 count
# 1 a x 3 13 2
# 2 b x 4 16 2
# 3 a y 7 14 2
# 4 b y 6 12 2
agg[c("val1", "val2")] <- agg[c("val1", "val2")] / agg$count
agg
# id1 id2 val1 val2 count
# 1 a x 1.5 6.5 2
# 2 b x 2.0 8.0 2
# 3 a y 3.5 7.0 2
# 4 b y 3.0 6.0 2
It has the advantage of preserving your column names and creating a single count
column.