Group by multiple columns and sum other multiple columns

后端 未结 7 537
孤城傲影
孤城傲影 2020-11-22 07:35

I have a data frame with about 200 columns, out of them I want to group the table by first 10 or so which are factors and sum the rest of the columns.

I have list of

相关标签:
7条回答
  • 2020-11-22 08:07

    This seems like a task for ddply (I use the 'baseball' dataset which is included with plyr):

    library(plyr)
    groupColumns = c("year","team")
    dataColumns = c("hr", "rbi","sb")
    res = ddply(baseball, groupColumns, function(x) colSums(x[dataColumns]))
    head(res)
    

    This gives per groupColumns the sum of the columns specified in dataColumns.

    0 讨论(0)
  • 2020-11-22 08:16

    Another way to do this with dplyr that would be generic (don't need list of columns) would be:

    df %>% group_by_if(is.factor) %>% summarize_if(is.numeric,sum,na.rm = TRUE)
    
    0 讨论(0)
  • 2020-11-22 08:19

    Using plyr::ddply:

    library(plyr)
    ddply(dtfr, .(name1, name2, namex), numcolwise(sum))
    
    0 讨论(0)
  • 2020-11-22 08:23

    The dplyr way would be:

    library(dplyr)
    df %>%
      group_by(col1, col2, col3) %>%
      summarise_each(funs(sum))
    

    You can further specify the columns to be summarised or excluded from the summarise_each by using the special functions mentioned in the help file of ?dplyr::select.

    0 讨论(0)
  • 2020-11-22 08:24

    In base R this would be...

    aggregate( as.matrix(df[,11:200]), as.list(df[,1:10]), FUN = sum)
    

    EDIT: The aggregate function has come a long way since I wrote this. None of the casting above is necessary.

    aggregate( df[,11:200], df[,1:10], FUN = sum )
    

    And there are a variety of ways to write this. Assuming the first 10 columns are named a1 through a10 I like the following, even though it is verbose.

    aggregate(. ~ a1 + a2 + a3 + a4 + a5 + a6 + a7 + a8 + a9 + a10, data = dat, FUN = sum)
    

    (You could use paste to construct the formula and use formula)

    0 讨论(0)
  • 2020-11-22 08:29

    Let's consider this example :

    df <- data.frame(a = 'a', b = c('a', 'a', 'b', 'b', 'b'), c = 1:5, d = 11:15,
                     stringsAsFactors = TRUE)
    

    _all, _at and _if verbs are now superseded and we use across now to group all the factor columns and sum all the other columns, we can do :

    library(dplyr)
    
    df %>% 
       group_by(across(where(is.factor))) %>% 
       summarise(across(everything(), sum))
    
    #  a     b         c     d
    #  <fct> <fct> <int> <int>
    #1 a     a         3    23
    #2 a     b        12    42
    

    To group all factor columns and sum numeric columns :

    df %>% 
      group_by(across(where(is.factor))) %>% 
      summarise(across(where(is.numeric), sum))
    

    We can also do this by position but have to be careful of the number since it doesn't count the grouping columns.

    df %>% group_by(across(1:2)) %>% summarise(across(1:2, sum))
    
    0 讨论(0)
提交回复
热议问题