How does one aggregate and summarize data quickly?

前端 未结 2 930
别跟我提以往
别跟我提以往 2020-12-29 08:05

I have a dataset whose headers look like so:

PID Time Site Rep Count

I want sum the Count by Rep for each P

相关标签:
2条回答
  • 2020-12-29 08:57

    You should look at the package data.table for faster aggregation operations on large data frames. For your problem, the solution would look like:

    library(data.table)
    data_t = data.table(data_tab)
    ans = data_t[,list(A = sum(count), B = mean(count)), by = 'PID,Time,Site']
    
    0 讨论(0)
  • 2020-12-29 09:00

    Let's see how fast data.table is and compare to using dplyr. Thishis would be roughly the way to do it in dplyr.

    data %>% group_by(PID, Time, Site, Rep) %>%
        summarise(totalCount = sum(Count)) %>%
        group_by(PID, Time, Site) %>% 
        summarise(mean(totalCount))
    

    Or perhaps this, depending on exactly how the question is interpreted:

        data %>% group_by(PID, Time, Site) %>%
            summarise(totalCount = sum(Count), meanCount = mean(Count)  
    

    Here is a full example of these alternatives versus @Ramnath proposed answer and the one @David Arenburg proposed in the comments , which I think is equivalent to the second dplyr statement.

    nrow <- 510000
    data <- data.frame(PID = sample(letters, nrow, replace = TRUE), 
                       Time = sample(letters, nrow, replace = TRUE),
                       Site = sample(letters, nrow, replace = TRUE),
                       Rep = rnorm(nrow),
                       Count = rpois(nrow, 100))
    
    
    library(dplyr)
    library(data.table)
    
    Rprof(tf1 <- tempfile())
    ans <- data %>% group_by(PID, Time, Site, Rep) %>%
        summarise(totalCount = sum(Count)) %>%
        group_by(PID, Time, Site) %>% 
        summarise(mean(totalCount))
    Rprof()
    summaryRprof(tf1)  #reports 1.68 sec sampling time
    
    Rprof(tf2 <- tempfile())
    ans <- data %>% group_by(PID, Time, Site, Rep) %>%
        summarise(total = sum(Count), meanCount = mean(Count)) 
    Rprof()
    summaryRprof(tf2)  # reports 1.60 seconds
    
    Rprof(tf3 <- tempfile())
    data_t = data.table(data)
    ans = data_t[,list(A = sum(Count), B = mean(Count)), by = 'PID,Time,Site']
    Rprof()
    summaryRprof(tf3)  #reports 0.06 seconds
    
    Rprof(tf4 <- tempfile())
    ans <- setDT(data)[,.(A = sum(Count), B = mean(Count)), by = 'PID,Time,Site']
    Rprof()
    summaryRprof(tf4)  #reports 0.02 seconds
    

    The data table method is much faster, and the setDT is even faster!

    0 讨论(0)
提交回复
热议问题