I\'m having some trouble aggregating a data frame while keeping the groups in their original order (order based on first appearance in data frame). I\'ve managed to get it right
Looking for solutions to the same problem, I found a new one using aggregate(), but first converting the select variables as factors with the order you want.
all.add <- names(orig.df)[!(names(orig.df)) %in% c("sel.1", "sel.2")]
# Selection variables as factor with leves in the order you want
orig.df$sel.1 <- factor(orig.df$sel.1, levels = unique(orig.df$sel.1))
orig.df$sel.2 <- factor(orig.df$sel.2, levels = unique(orig.df$sel.2))
# This is ordered first by sel.1, then by sel.2
aggr.df.ordered <- aggregate(orig.df[,all.add],
by=list(sel.1 = orig.df$sel.1, sel.2 = orig.df$sel.2), sum)
The output is:
newvar add.1 add.2
1 1 1 100 91
2 1 4 170 183
3 1 5 384 366
4 2 2 175 176
5 2 3 90 96
6 2 4 82 87
7 2 5 95 89
8 3 2 189 178
9 3 3 81 82
10 4 1 174 192
11 5 3 91 98
12 5 4 96 84
13 5 5 83 88
To have it ordered for the first appearance of each combination of both variables, you need a new variable:
# ordered by first appearance of the two variables (needs a new variable)
orig.df$newvar <- paste(orig.df$sel.1, orig.df$sel.2)
orig.df$newvar <- factor(orig.df$newvar, levels = unique(orig.df$newvar))
aggr.df.ordered2 <- aggregate(orig.df[,all.add],
by=list(newvar = orig.df$newvar,
sel.1 = orig.df$sel.1,
sel.2 = orig.df$sel.2), sum)
which gives the output:
newvar sel.2 sel.1 add.1 add.2
1 5 4 4 5 96 84
2 5 5 5 5 83 88
3 5 3 3 5 91 98
4 2 4 4 2 82 87
5 2 2 2 2 175 176
6 2 5 5 2 95 89
7 2 3 3 2 90 96
8 1 4 4 1 170 183
9 1 5 5 1 384 366
10 1 1 1 1 100 91
11 4 1 1 4 174 192
12 3 2 2 3 189 178
13 3 3 3 3 81 82
With this solution you do not need to install any new package.
It's short and simple in data.table. It returns the groups in first appearance order by default.
require(data.table)
DT = as.data.table(orig.df)
DT[, list(sum(add.1),sum(add.2)), by=list(sel.1,sel.2)]
sel.1 sel.2 V1 V2
1: 5 4 96 84
2: 2 2 175 176
3: 1 5 384 366
4: 2 5 95 89
5: 4 1 174 192
6: 2 4 82 87
7: 5 3 91 98
8: 3 2 189 178
9: 1 4 170 183
10: 1 1 100 91
11: 3 3 81 82
12: 5 5 83 88
13: 2 3 90 96
And this will be fast for large data, so no need to change your code later if you do find speed issues. The following alternative syntax is the easiest way to pass in which columns to group by.
DT[, lapply(.SD,sum), by=c("sel.1","sel.2")]
sel.1 sel.2 add.1 add.2
1: 5 4 96 84
2: 2 2 175 176
3: 1 5 384 366
4: 2 5 95 89
5: 4 1 174 192
6: 2 4 82 87
7: 5 3 91 98
8: 3 2 189 178
9: 1 4 170 183
10: 1 1 100 91
11: 3 3 81 82
12: 5 5 83 88
13: 2 3 90 96
or, by
may also be a single comma separated string of column names, too :
DT[, lapply(.SD,sum), by="sel.1,sel.2"]
A bit tough to read, but it gives you what you want and I added some comments to clarify.
# Define the columns you want to combine into the grouping variable
sel.col <- grepl("^sel", names(orig.df))
# Create the grouping variable
lev <- apply(orig.df[sel.col], 1, paste, collapse=" ")
# Split and sum up
data.frame(unique(orig.df[sel.col]),
t(sapply(split(orig.df[!sel.col], factor(lev, levels=unique(lev))),
apply, 2, sum)))
The output looks like this
sel.1 sel.2 add.1 add.2
1 5 4 96 84
2 2 2 175 176
3 1 5 384 366
5 2 5 95 89
6 4 1 174 192
7 2 4 82 87
8 5 3 91 98
10 3 2 189 178
11 1 4 170 183
14 1 1 100 91
17 3 3 81 82
19 5 5 83 88
20 2 3 90 96
Not sure how this solution is for speed and storage capacity etc. for large datasets, but I thought it was a pretty easy way for solving this problem.
# Create dataframe
x <- c("C", "C", "A", "A", "A","B", "B")
y <- c(5, 6, 3, 2, 7, 8, 9)
df <- data.frame(x, y)
df
Original dataframe:
x y
1 C 5
2 C 6
3 A 3
4 A 2
5 A 7
6 B 8
7 B 9
Solution:
# Add column with the original order
order <- seq(1:length(df$x))
df$order <- order
# Aggregate
# use sum for column Y (the variable you want to aggregate according to X)
df1 <- aggregate(y~x,data=df,FUN=sum)
# use mean for column 'order'
df2 <- aggregate(order~x, data=df,FUN=mean)
# Add the mean of order values to the dataframe
df <- df1
df$order <- df2$order
# Order the dataframe according the the mean of order values
df <- df[order(df$order),]
df
Aggregated dataframe with same order:
x y order
3 C 11 1.5
1 A 12 4.0
2 B 17 6.5