R- Collapse rows and sum the values in the column

后端 未结 5 719
情话喂你
情话喂你 2021-02-06 02:46

I have the following dataframe (df1):

ID    someText    PSM OtherValues
ABC   c   2   qwe
CCC   v   3   wer
DDD   b   56  ert
EEE   m   78  yu
FFF           


        
5条回答
  •  梦毁少年i
    2021-02-06 03:37

    Using aggregate function seems to be better than dplyr if you want to just keep the original column names and operate inside one column at a time. Avoiding the use of summarize function,

    Note from summarize function documentation

    Be careful when using existing variable names; the corresponding columns will be immediately updated with the new data and this can affect subsequent operations referring to those variables.

    For instance

    ## modified example from aggregate documentation with character variables and NAs
    testDF <- data.frame(v1 = c(1,3,5,7,8,3,5,NA,4,5,7,9),
                     v2 = c(11,33,55,77,88,33,55,NA,44,55,77,99) )
    by <- c("red", "blue", 1, 2, NA, "big", 1, 2, "red", 1, NA, 12)
    
    aggregate(x = testDF, by = list(by1), FUN = "sum")
    Group.1 v1  v2
    1       1 15 165
    2      12  9  99
    3       2 NA  NA
    4     big  3  33
    5    blue  3  33
    6     red  5  55
    

    You get what you want, but when you use summarise and ddply you need to specify names. So if you have many columns aggregate seems to be convenient.

    testDF$ID=by1
    ddply(testDF, .(ID), summarize, v1=sum(v1), v2=sum(v2) )
    ID v1  v2
    1    1 15 165
    2   12  9  99
    3    2 NA  NA
    4  big  3  33
    5 blue  3  33
    6  red  5  55
    7  15 165
    

    To see the effect of the immediate update of the columns with summarize you can check the following examples,

    ddply(testDF, .(ID), summarize, v1=max(v1,v2), v2=min(v1,v2) )
    ID v1 v2
    1    1 55 55
    2   12 99 99
    3    2 NA NA
    4  big 33 33
    5 blue 33 33
    6  red 44 11
    7  88 77
    
    ddply(testDF, .(ID), summarize, v1=min(v1,v2), v2=min(v1,v2) )
    ID v1 v2
    1    1  5  5
    2   12  9  9
    3    2 NA NA
    4  big  3  3
    5 blue  3  3
    6  red  1  1
    7   7  7
    

    Note that when V1 uses max, the col is already update when calculating v2, so for instance in the case of ID=1 we can't get the number 5 when using min in v2.

提交回复
热议问题