Aggregate data frame while keeping original order, in a simple manner

前端未结

关注

 4  1061

I\'m having some trouble aggregating a data frame while keeping the groups in their original order (order based on first appearance in data frame). I\'ve managed to get it right

相关标签:

4条回答

广开言路

2021-02-15 12:52

Looking for solutions to the same problem, I found a new one using aggregate(), but first converting the select variables as factors with the order you want.

all.add <- names(orig.df)[!(names(orig.df)) %in% c("sel.1", "sel.2")]

# Selection variables as factor with leves in the order you want
orig.df$sel.1 <- factor(orig.df$sel.1, levels = unique(orig.df$sel.1))
orig.df$sel.2 <- factor(orig.df$sel.2, levels = unique(orig.df$sel.2))

# This is ordered first by sel.1, then by sel.2
aggr.df.ordered <- aggregate(orig.df[,all.add], 
                             by=list(sel.1 = orig.df$sel.1, sel.2 = orig.df$sel.2), sum)

The output is:

   newvar add.1 add.2
1     1 1   100    91
2     1 4   170   183
3     1 5   384   366
4     2 2   175   176
5     2 3    90    96
6     2 4    82    87
7     2 5    95    89
8     3 2   189   178
9     3 3    81    82
10    4 1   174   192
11    5 3    91    98
12    5 4    96    84
13    5 5    83    88

To have it ordered for the first appearance of each combination of both variables, you need a new variable:

# ordered by first appearance of the two variables (needs a new variable)
orig.df$newvar <- paste(orig.df$sel.1, orig.df$sel.2)
orig.df$newvar <- factor(orig.df$newvar, levels = unique(orig.df$newvar))

aggr.df.ordered2 <- aggregate(orig.df[,all.add], 
                              by=list(newvar = orig.df$newvar,
                                      sel.1 = orig.df$sel.1, 
                                      sel.2 = orig.df$sel.2), sum)

which gives the output:

   newvar sel.2 sel.1 add.1 add.2
1     5 4     4     5    96    84
2     5 5     5     5    83    88
3     5 3     3     5    91    98
4     2 4     4     2    82    87
5     2 2     2     2   175   176
6     2 5     5     2    95    89
7     2 3     3     2    90    96
8     1 4     4     1   170   183
9     1 5     5     1   384   366
10    1 1     1     1   100    91
11    4 1     1     4   174   192
12    3 2     2     3   189   178
13    3 3     3     3    81    82

With this solution you do not need to install any new package.

0 讨论(0)

青春惊慌失措

2021-02-15 12:55

It's short and simple in data.table. It returns the groups in first appearance order by default.

require(data.table)
DT = as.data.table(orig.df)
DT[, list(sum(add.1),sum(add.2)), by=list(sel.1,sel.2)]

    sel.1 sel.2  V1  V2
 1:     5     4  96  84
 2:     2     2 175 176
 3:     1     5 384 366
 4:     2     5  95  89
 5:     4     1 174 192
 6:     2     4  82  87
 7:     5     3  91  98
 8:     3     2 189 178
 9:     1     4 170 183
10:     1     1 100  91
11:     3     3  81  82
12:     5     5  83  88
13:     2     3  90  96

And this will be fast for large data, so no need to change your code later if you do find speed issues. The following alternative syntax is the easiest way to pass in which columns to group by.

DT[, lapply(.SD,sum), by=c("sel.1","sel.2")]

    sel.1 sel.2 add.1 add.2
 1:     5     4    96    84
 2:     2     2   175   176
 3:     1     5   384   366
 4:     2     5    95    89
 5:     4     1   174   192
 6:     2     4    82    87
 7:     5     3    91    98
 8:     3     2   189   178
 9:     1     4   170   183
10:     1     1   100    91
11:     3     3    81    82
12:     5     5    83    88
13:     2     3    90    96

or, by may also be a single comma separated string of column names, too :

DT[, lapply(.SD,sum), by="sel.1,sel.2"]

0 讨论(0)

庸人自扰

2021-02-15 13:01

A bit tough to read, but it gives you what you want and I added some comments to clarify.

# Define the columns you want to combine into the grouping variable
sel.col <- grepl("^sel", names(orig.df))
# Create the grouping variable
lev <- apply(orig.df[sel.col], 1, paste, collapse=" ")
# Split and sum up
data.frame(unique(orig.df[sel.col]),
           t(sapply(split(orig.df[!sel.col], factor(lev, levels=unique(lev))),
                    apply, 2, sum)))

The output looks like this

   sel.1 sel.2 add.1 add.2
1      5     4    96    84
2      2     2   175   176
3      1     5   384   366
5      2     5    95    89
6      4     1   174   192
7      2     4    82    87
8      5     3    91    98
10     3     2   189   178
11     1     4   170   183
14     1     1   100    91
17     3     3    81    82
19     5     5    83    88
20     2     3    90    96

0 讨论(0)

甜味超标

2021-02-15 13:11

Not sure how this solution is for speed and storage capacity etc. for large datasets, but I thought it was a pretty easy way for solving this problem.

# Create dataframe
x <- c("C", "C", "A", "A", "A","B", "B")
y <- c(5, 6, 3, 2, 7, 8, 9)
df <- data.frame(x, y)
df

Original dataframe:

Solution:

# Add column with the original order
order <- seq(1:length(df$x))
df$order <- order

# Aggregate
# use sum for column Y (the variable you want to aggregate according to X)
df1 <- aggregate(y~x,data=df,FUN=sum)
# use mean for column 'order'
df2 <- aggregate(order~x, data=df,FUN=mean)

# Add the mean of order values to the dataframe
df <- df1
df$order <- df2$order

# Order the dataframe according the the mean of order values
df <- df[order(df$order),]
df

Aggregated dataframe with same order:

  x  y order
3 C 11   1.5
1 A 12   4.0
2 B 17   6.5

0 讨论(0)