R group by aggregate

后端 未结 3 1088
南笙
南笙 2021-01-22 10:37

In R (which I am relatively new to) I have a data frame consists of many column and a numeric column I need to aggregate according to groups determined by another column.

<
相关标签:
3条回答
  • 2021-01-22 11:15

    Here's my solution using aggregate.

    First, load the data:

    df <- read.table(text = 
    "SessionID   Price
    '1'       '624.99'
    '1'       '697.99'
    '1'       '649.00'
    '7'       '779.00'
    '7'       '710.00'
    '7'       '2679.50'", header = TRUE) 
    

    Then aggregate and match it back to the original data.frame:

    tmp <- aggregate(Price ~ SessionID, df, function(x) c(Min = min(x), Max = max(x)))
    df <- cbind(df, tmp[match(df$SessionID, tmp$SessionID), 2])
    print(df)
    #  SessionID   Price    Min     Max
    #1         1  624.99 624.99  697.99
    #2         1  697.99 624.99  697.99
    #3         1  649.00 624.99  697.99
    #4         7  779.00 710.00 2679.50
    #5         7  710.00 710.00 2679.50
    #6         7 2679.50 710.00 2679.50
    

    EDIT: As per the comment below, you might wonder why this works. It indeed is somewhat weird. But remember that a data.frame just is a fancy list. Try to call str(tmp), and you'll see that the Price column itself is 2 by 2 numeric matrix. It gets confusing as the print.data.frame knows how to handle this and so print(tmp) looks like there are 3 columns. Anyway, tmp[2] simply access the second column/entry of the data.frame/list and returns that 1 column data.frame while tmp[,2] access the second column and return the data type stored.

    0 讨论(0)
  • 2021-01-22 11:18

    Using data.table package:

    library(data.table)
    
    dt = data.table(SessionID=c(1,1,1,7,7,7), Price=c(624,697,649,779,710,2679))
    
    dt[, c("Min", "Max"):=list(min(Price),max(Price)), by=SessionID]
    dt
    #   SessionId Price Min  Max
    #1:         1   624 624  697
    #2:         1   697 624  697
    #3:         1   649 624  697
    #4:         7   779 710 2679
    #5:         7   710 710 2679
    #6:         7  2679 710 2679
    

    In your case if you have a data.frame df, just do dt=as.data.table(df) and use the code above.

    I am curious about the benchmark of the solutions on an average data.frame:

    df = data.frame(SessionID=rep(1:1000, each=100), Price=runif(100000, 1, 2000))
    dt = as.data.table(df)
    
    algo1 <- function() 
    {
        df %>% group_by(SessionID) %>% mutate(Min = min(Price), Max = max(Price))
    }
    
    algo2 <- function()
    {
        dt[, c("Min", "Max"):=list(min(Price),max(Price)), by=SessionID]
    }
    
    algo3 <- function()
    {
        tmp <- aggregate(Price ~ SessionID, df, function(x) c(Min = min(x), Max = max(x)))
        cbind(df, tmp[match(df$SessionID, tmp$SessionID), 2])
    }
    
    algo4 <- function()
    {
        transform(df, Min = ave(Price, SessionID, FUN = min), Max = ave(Price, SessionID, FUN = max))
    }   
    
    
    
    #> system.time(algo1())
    #   user  system elapsed 
    #   0.03    0.00    0.19 
    
    #> system.time(algo2())
    #   user  system elapsed 
    #   0.01    0.00    0.01 
    
    #> system.time(algo3())
    #   user  system elapsed 
    #   0.77    0.01    0.78 
    
    #> system.time(algo4())
    #   user  system elapsed 
    #   0.02    0.01    0.03 
    
    0 讨论(0)
  • 2021-01-22 11:25

    Using base R:

    df <- transform(df, Min = ave(Price, SessionID, FUN = min),
                        Max = ave(Price, SessionID, FUN = max))
    df
    #  SessionID   Price    Min     Max
    #1         1  624.99 624.99  697.99
    #2         1  697.99 624.99  697.99
    #3         1  649.00 624.99  697.99
    #4         7  779.00 710.00 2679.50
    #5         7  710.00 710.00 2679.50
    #6         7 2679.50 710.00 2679.50
    

    Since your desired result is not aggregated but just the original data with two extra columns, you want to use ave in base R instead of aggregate, which you would typically use if you wanted to aggregate the data by SessionID. (NB: AEBilgrau shows that you could also do it with aggregate with some additional matching.)

    Similarly, for dplyr, you want to use mutate instead of summarise because you dont want to aggregate/summarise the data.

    Using dplyr:

    library(dplyr)
    df <- df %>% group_by(SessionID) %>% mutate(Min = min(Price), Max = max(Price))
    
    0 讨论(0)
提交回复
热议问题