R group by aggregate

后端未结

关注

 3  1088

In R (which I am relatively new to) I have a data frame consists of many column and a numeric column I need to aggregate according to groups determined by another column.

相关标签:

3条回答

长情又很酷

2021-01-22 11:15
Here's my solution using aggregate.

First, load the data:
```
df <- read.table(text = 
"SessionID   Price
'1'       '624.99'
'1'       '697.99'
'1'       '649.00'
'7'       '779.00'
'7'       '710.00'
'7'       '2679.50'", header = TRUE) 
```
Then aggregate and match it back to the original data.frame:
```
tmp <- aggregate(Price ~ SessionID, df, function(x) c(Min = min(x), Max = max(x)))
df <- cbind(df, tmp[match(df$SessionID, tmp$SessionID), 2])
print(df)
#  SessionID   Price    Min     Max
#1         1  624.99 624.99  697.99
#2         1  697.99 624.99  697.99
#3         1  649.00 624.99  697.99
#4         7  779.00 710.00 2679.50
#5         7  710.00 710.00 2679.50
#6         7 2679.50 710.00 2679.50
```
EDIT: As per the comment below, you might wonder why this works. It indeed is somewhat weird. But remember that a data.frame just is a fancy list. Try to call str(tmp), and you'll see that the Price column itself is 2 by 2 numeric matrix. It gets confusing as the print.data.frame knows how to handle this and so print(tmp) looks like there are 3 columns. Anyway, tmp[2] simply access the second column/entry of the data.frame/list and returns that 1 column data.frame while tmp[,2] access the second column and return the data type stored.
0 讨论(0)
发布评论:

提交评论
- 加载中...

时光取名叫无心

2021-01-22 11:18

Using data.table package:

library(data.table)

dt = data.table(SessionID=c(1,1,1,7,7,7), Price=c(624,697,649,779,710,2679))

dt[, c("Min", "Max"):=list(min(Price),max(Price)), by=SessionID]
dt
#   SessionId Price Min  Max
#1:         1   624 624  697
#2:         1   697 624  697
#3:         1   649 624  697
#4:         7   779 710 2679
#5:         7   710 710 2679
#6:         7  2679 710 2679

In your case if you have a data.frame df, just do dt=as.data.table(df) and use the code above.

I am curious about the benchmark of the solutions on an average data.frame:

df = data.frame(SessionID=rep(1:1000, each=100), Price=runif(100000, 1, 2000))
dt = as.data.table(df)

algo1 <- function() 
{
    df %>% group_by(SessionID) %>% mutate(Min = min(Price), Max = max(Price))
}

algo2 <- function()
{
    dt[, c("Min", "Max"):=list(min(Price),max(Price)), by=SessionID]
}

algo3 <- function()
{
    tmp <- aggregate(Price ~ SessionID, df, function(x) c(Min = min(x), Max = max(x)))
    cbind(df, tmp[match(df$SessionID, tmp$SessionID), 2])
}

algo4 <- function()
{
    transform(df, Min = ave(Price, SessionID, FUN = min), Max = ave(Price, SessionID, FUN = max))
}   



#> system.time(algo1())
#   user  system elapsed 
#   0.03    0.00    0.19 

#> system.time(algo2())
#   user  system elapsed 
#   0.01    0.00    0.01 

#> system.time(algo3())
#   user  system elapsed 
#   0.77    0.01    0.78 

#> system.time(algo4())
#   user  system elapsed 
#   0.02    0.01    0.03

0 讨论(0)

小鲜肉

2021-01-22 11:25
Using base R:
```
df <- transform(df, Min = ave(Price, SessionID, FUN = min),
                    Max = ave(Price, SessionID, FUN = max))
df
#  SessionID   Price    Min     Max
#1         1  624.99 624.99  697.99
#2         1  697.99 624.99  697.99
#3         1  649.00 624.99  697.99
#4         7  779.00 710.00 2679.50
#5         7  710.00 710.00 2679.50
#6         7 2679.50 710.00 2679.50
```
Since your desired result is not aggregated but just the original data with two extra columns, you want to use ave in base R instead of aggregate, which you would typically use if you wanted to aggregate the data by SessionID. (NB: AEBilgrau shows that you could also do it with aggregate with some additional matching.)

Similarly, for dplyr, you want to use mutate instead of summarise because you dont want to aggregate/summarise the data.

Using dplyr:
```
library(dplyr)
df <- df %>% group_by(SessionID) %>% mutate(Min = min(Price), Max = max(Price))
```
0 讨论(0)
发布评论:

提交评论
- 加载中...