In R (which I am relatively new to) I have a data frame consists of many column and a numeric column I need to aggregate according to groups determined by another column.
<Here's my solution using aggregate
.
First, load the data:
df <- read.table(text =
"SessionID Price
'1' '624.99'
'1' '697.99'
'1' '649.00'
'7' '779.00'
'7' '710.00'
'7' '2679.50'", header = TRUE)
Then aggregate
and match
it back to the original data.frame
:
tmp <- aggregate(Price ~ SessionID, df, function(x) c(Min = min(x), Max = max(x)))
df <- cbind(df, tmp[match(df$SessionID, tmp$SessionID), 2])
print(df)
# SessionID Price Min Max
#1 1 624.99 624.99 697.99
#2 1 697.99 624.99 697.99
#3 1 649.00 624.99 697.99
#4 7 779.00 710.00 2679.50
#5 7 710.00 710.00 2679.50
#6 7 2679.50 710.00 2679.50
EDIT: As per the comment below, you might wonder why this works. It indeed is somewhat weird. But remember that a data.frame
just is a fancy list
. Try to call str(tmp)
, and you'll see that the Price
column itself is 2 by 2 numeric matrix. It gets confusing as the print.data.frame
knows how to handle this and so print(tmp)
looks like there are 3 columns. Anyway, tmp[2]
simply access the second column
/entry
of the data.frame
/list
and returns that 1 column data.frame
while tmp[,2]
access the second column and return the data type stored.
Using data.table
package:
library(data.table)
dt = data.table(SessionID=c(1,1,1,7,7,7), Price=c(624,697,649,779,710,2679))
dt[, c("Min", "Max"):=list(min(Price),max(Price)), by=SessionID]
dt
# SessionId Price Min Max
#1: 1 624 624 697
#2: 1 697 624 697
#3: 1 649 624 697
#4: 7 779 710 2679
#5: 7 710 710 2679
#6: 7 2679 710 2679
In your case if you have a data.frame df
, just do dt=as.data.table(df)
and use the code above.
I am curious about the benchmark of the solutions on an average data.frame:
df = data.frame(SessionID=rep(1:1000, each=100), Price=runif(100000, 1, 2000))
dt = as.data.table(df)
algo1 <- function()
{
df %>% group_by(SessionID) %>% mutate(Min = min(Price), Max = max(Price))
}
algo2 <- function()
{
dt[, c("Min", "Max"):=list(min(Price),max(Price)), by=SessionID]
}
algo3 <- function()
{
tmp <- aggregate(Price ~ SessionID, df, function(x) c(Min = min(x), Max = max(x)))
cbind(df, tmp[match(df$SessionID, tmp$SessionID), 2])
}
algo4 <- function()
{
transform(df, Min = ave(Price, SessionID, FUN = min), Max = ave(Price, SessionID, FUN = max))
}
#> system.time(algo1())
# user system elapsed
# 0.03 0.00 0.19
#> system.time(algo2())
# user system elapsed
# 0.01 0.00 0.01
#> system.time(algo3())
# user system elapsed
# 0.77 0.01 0.78
#> system.time(algo4())
# user system elapsed
# 0.02 0.01 0.03
Using base R:
df <- transform(df, Min = ave(Price, SessionID, FUN = min),
Max = ave(Price, SessionID, FUN = max))
df
# SessionID Price Min Max
#1 1 624.99 624.99 697.99
#2 1 697.99 624.99 697.99
#3 1 649.00 624.99 697.99
#4 7 779.00 710.00 2679.50
#5 7 710.00 710.00 2679.50
#6 7 2679.50 710.00 2679.50
Since your desired result is not aggregated but just the original data with two extra columns, you want to use ave
in base R instead of aggregate
, which you would typically use if you wanted to aggregate
the data by SessionID. (NB: AEBilgrau shows that you could also do it with aggregate with some additional matching.)
Similarly, for dplyr, you want to use mutate
instead of summarise
because you dont want to aggregate/summarise the data.
Using dplyr:
library(dplyr)
df <- df %>% group_by(SessionID) %>% mutate(Min = min(Price), Max = max(Price))