Aggregate data frame while keeping original order, in a simple manner

前端 未结 4 1016
囚心锁ツ
囚心锁ツ 2021-02-15 12:38

I\'m having some trouble aggregating a data frame while keeping the groups in their original order (order based on first appearance in data frame). I\'ve managed to get it right

相关标签:
4条回答
  • 2021-02-15 12:52

    Looking for solutions to the same problem, I found a new one using aggregate(), but first converting the select variables as factors with the order you want.

    all.add <- names(orig.df)[!(names(orig.df)) %in% c("sel.1", "sel.2")]
    
    # Selection variables as factor with leves in the order you want
    orig.df$sel.1 <- factor(orig.df$sel.1, levels = unique(orig.df$sel.1))
    orig.df$sel.2 <- factor(orig.df$sel.2, levels = unique(orig.df$sel.2))
    
    # This is ordered first by sel.1, then by sel.2
    aggr.df.ordered <- aggregate(orig.df[,all.add], 
                                 by=list(sel.1 = orig.df$sel.1, sel.2 = orig.df$sel.2), sum)
    

    The output is:

       newvar add.1 add.2
    1     1 1   100    91
    2     1 4   170   183
    3     1 5   384   366
    4     2 2   175   176
    5     2 3    90    96
    6     2 4    82    87
    7     2 5    95    89
    8     3 2   189   178
    9     3 3    81    82
    10    4 1   174   192
    11    5 3    91    98
    12    5 4    96    84
    13    5 5    83    88
    

    To have it ordered for the first appearance of each combination of both variables, you need a new variable:

    # ordered by first appearance of the two variables (needs a new variable)
    orig.df$newvar <- paste(orig.df$sel.1, orig.df$sel.2)
    orig.df$newvar <- factor(orig.df$newvar, levels = unique(orig.df$newvar))
    
    aggr.df.ordered2 <- aggregate(orig.df[,all.add], 
                                  by=list(newvar = orig.df$newvar,
                                          sel.1 = orig.df$sel.1, 
                                          sel.2 = orig.df$sel.2), sum)
    

    which gives the output:

       newvar sel.2 sel.1 add.1 add.2
    1     5 4     4     5    96    84
    2     5 5     5     5    83    88
    3     5 3     3     5    91    98
    4     2 4     4     2    82    87
    5     2 2     2     2   175   176
    6     2 5     5     2    95    89
    7     2 3     3     2    90    96
    8     1 4     4     1   170   183
    9     1 5     5     1   384   366
    10    1 1     1     1   100    91
    11    4 1     1     4   174   192
    12    3 2     2     3   189   178
    13    3 3     3     3    81    82
    

    With this solution you do not need to install any new package.

    0 讨论(0)
  • 2021-02-15 12:55

    It's short and simple in data.table. It returns the groups in first appearance order by default.

    require(data.table)
    DT = as.data.table(orig.df)
    DT[, list(sum(add.1),sum(add.2)), by=list(sel.1,sel.2)]
    
        sel.1 sel.2  V1  V2
     1:     5     4  96  84
     2:     2     2 175 176
     3:     1     5 384 366
     4:     2     5  95  89
     5:     4     1 174 192
     6:     2     4  82  87
     7:     5     3  91  98
     8:     3     2 189 178
     9:     1     4 170 183
    10:     1     1 100  91
    11:     3     3  81  82
    12:     5     5  83  88
    13:     2     3  90  96
    

    And this will be fast for large data, so no need to change your code later if you do find speed issues. The following alternative syntax is the easiest way to pass in which columns to group by.

    DT[, lapply(.SD,sum), by=c("sel.1","sel.2")]
    
        sel.1 sel.2 add.1 add.2
     1:     5     4    96    84
     2:     2     2   175   176
     3:     1     5   384   366
     4:     2     5    95    89
     5:     4     1   174   192
     6:     2     4    82    87
     7:     5     3    91    98
     8:     3     2   189   178
     9:     1     4   170   183
    10:     1     1   100    91
    11:     3     3    81    82
    12:     5     5    83    88
    13:     2     3    90    96
    

    or, by may also be a single comma separated string of column names, too :

    DT[, lapply(.SD,sum), by="sel.1,sel.2"]
    
    0 讨论(0)
  • 2021-02-15 13:01

    A bit tough to read, but it gives you what you want and I added some comments to clarify.

    # Define the columns you want to combine into the grouping variable
    sel.col <- grepl("^sel", names(orig.df))
    # Create the grouping variable
    lev <- apply(orig.df[sel.col], 1, paste, collapse=" ")
    # Split and sum up
    data.frame(unique(orig.df[sel.col]),
               t(sapply(split(orig.df[!sel.col], factor(lev, levels=unique(lev))),
                        apply, 2, sum)))
    

    The output looks like this

       sel.1 sel.2 add.1 add.2
    1      5     4    96    84
    2      2     2   175   176
    3      1     5   384   366
    5      2     5    95    89
    6      4     1   174   192
    7      2     4    82    87
    8      5     3    91    98
    10     3     2   189   178
    11     1     4   170   183
    14     1     1   100    91
    17     3     3    81    82
    19     5     5    83    88
    20     2     3    90    96
    
    0 讨论(0)
  • 2021-02-15 13:11

    Not sure how this solution is for speed and storage capacity etc. for large datasets, but I thought it was a pretty easy way for solving this problem.

    # Create dataframe
    x <- c("C", "C", "A", "A", "A","B", "B")
    y <- c(5, 6, 3, 2, 7, 8, 9)
    df <- data.frame(x, y)
    df
    

    Original dataframe:

      x y
    1 C 5
    2 C 6
    3 A 3
    4 A 2
    5 A 7
    6 B 8
    7 B 9
    

    Solution:

    # Add column with the original order
    order <- seq(1:length(df$x))
    df$order <- order
    
    # Aggregate
    # use sum for column Y (the variable you want to aggregate according to X)
    df1 <- aggregate(y~x,data=df,FUN=sum)
    # use mean for column 'order'
    df2 <- aggregate(order~x, data=df,FUN=mean)
    
    # Add the mean of order values to the dataframe
    df <- df1
    df$order <- df2$order
    
    # Order the dataframe according the the mean of order values
    df <- df[order(df$order),]
    df
    

    Aggregated dataframe with same order:

      x  y order
    3 C 11   1.5
    1 A 12   4.0
    2 B 17   6.5
    
    0 讨论(0)
提交回复
热议问题