Apply several summary functions on several variables by group in one call

前端 未结 7 1504
一个人的身影 2020-11-22 00:03

I have the following data frame

x <- read.table(text = \"  id1 id2 val1 val2
1   a   x    1    9
2   a   x    2    4
3   a   y    3    5
4   a   y    4            

  • 2020-11-22 00:05

    Using the dplyr package you could achieve this by using summarise_all. With this summarise-function you can apply other functions (in this case mean and n()) to each of the non-grouping columns:

    x %>%
      group_by(id1, id2) %>%
      summarise_all(funs(mean, n()))

    which gives:

         id1    id2 val1_mean val2_mean val1_n val2_n
    1      a      x       1.5       6.5      2      2
    2      a      y       3.5       7.0      2      2
    3      b      x       2.0       8.0      2      2
    4      b      y       3.0       6.0      2      2

    If you don't want to apply the function(s) to all non-grouping columns, you specify the columns to which they should be applied or by excluding the non-wanted with a minus using the summarise_at() function:

    # inclusion
    x %>%
      group_by(id1, id2) %>%
      summarise_at(vars(val1, val2), funs(mean, n()))
    # exclusion
    x %>%
      group_by(id1, id2) %>%
      summarise_at(vars(-val2), funs(mean, n()))
    0 讨论(0)
  • 2020-11-22 00:06

    Another dplyr option is across which is part of current dev version

    x %>% 
      group_by(id1, id2) %>% 
      summarise(across(starts_with("val"), list(mean = mean, n = length)))


    # A tibble: 4 x 4
    # Groups:   id1 [2]
      id1   id2   mean$val1 $val2 n$val1 $val2
      <fct> <fct>     <dbl> <dbl>  <int> <int>
    1 a     x           1.5   6.5      2     2
    2 a     y           3.5   7        2     2
    3 b     x           2     8        2     2
    4 b     y           3     6        2     2

    [1] ‘’
    0 讨论(0)
  • 2020-11-22 00:12

    Given this in the question :

    I could use the plyr package, but my data set is quite large and plyr is very slow (almost unusable) when the size of the dataset grows.

    Then in data.table (1.9.4+) you could try :

    > DT
       id1 id2 val1 val2
    1:   a   x    1    9
    2:   a   x    2    4
    3:   a   y    3    5
    4:   a   y    4    9
    5:   b   x    1    7
    6:   b   y    4    4
    7:   b   x    3    9
    8:   b   y    2    8
    > DT[ , .(mean(val1), mean(val2), .N), by = .(id1, id2)]   # simplest
       id1 id2  V1  V2 N
    1:   a   x 1.5 6.5 2
    2:   a   y 3.5 7.0 2
    3:   b   x 2.0 8.0 2
    4:   b   y 3.0 6.0 2
    > DT[ , .(val1.m = mean(val1), val2.m = mean(val2), count = .N), by = .(id1, id2)]  # named
       id1 id2 val1.m val2.m count
    1:   a   x    1.5    6.5     2
    2:   a   y    3.5    7.0     2
    3:   b   x    2.0    8.0     2
    4:   b   y    3.0    6.0     2
    > DT[ , c(lapply(.SD, mean), count = .N), by = .(id1, id2)]   # mean over all columns
       id1 id2 val1 val2 count
    1:   a   x  1.5  6.5     2
    2:   a   y  3.5  7.0     2
    3:   b   x  2.0  8.0     2
    4:   b   y  3.0  6.0     2

    For timings comparing aggregate (used in question and all 3 other answers) to data.table see this benchmark (the agg and agg.x cases).

    0 讨论(0)
  • 2020-11-22 00:16

    You can also use the plyr::each() to introduce multiple functions:

    aggregate(cbind(val1, val2) ~ id1 + id2, data = x, FUN = plyr::each(avg = mean, n = length))
    0 讨论(0)
  • 2020-11-22 00:23

    Perhaps you want to merge?

    x.mean <- aggregate(. ~ id1+id2, p, mean)
    x.len  <- aggregate(. ~ id1+id2, p, length)
    merge(x.mean, x.len, by = c("id1", "id2"))
      id1 id2 val1.x val2.x val1.y val2.y
    1   a   x    1.5    6.5      2      2
    2   a   y    3.5    7.0      2      2
    3   b   x    2.0    8.0      2      2
    4   b   y    3.0    6.0      2      2
    0 讨论(0)
  • 2020-11-22 00:26

    You could add a count column, aggregate with sum, then scale back to get the mean:

    x$count <- 1
    agg <- aggregate(. ~ id1 + id2, data = x,FUN = sum)
    #   id1 id2 val1 val2 count
    # 1   a   x    3   13     2
    # 2   b   x    4   16     2
    # 3   a   y    7   14     2
    # 4   b   y    6   12     2
    agg[c("val1", "val2")] <- agg[c("val1", "val2")] / agg$count
    #   id1 id2 val1 val2 count
    # 1   a   x  1.5  6.5     2
    # 2   b   x  2.0  8.0     2
    # 3   a   y  3.5  7.0     2
    # 4   b   y  3.0  6.0     2

    It has the advantage of preserving your column names and creating a single count column.

    0 讨论(0)