How does one aggregate and summarize data quickly?

前端未结

关注

 2  934

I have a dataset whose headers look like so:

PID Time Site Rep Count

I want sum the Count by Rep for each P


                      
              相关标签:


      
      
        
          2条回答        

        
                         				            
            
           
            
                              
                
              
              
                
                  南方客        
                
              
                            
                2020-12-29 08:57
              
            
            
                                                                       
You should look at the package data.table for faster aggregation operations on large data frames. For your problem, the solution would look like:

library(data.table)
data_t = data.table(data_tab)
ans = data_t[,list(A = sum(count), B = mean(count)), by = 'PID,Time,Site']

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  梦谈多话        
                
              
                            
                2020-12-29 09:00
              
            
            
                                                                       
Let's see how fast data.table is and compare to using dplyr. Thishis would be roughly the way to do it in dplyr. 

data %>% group_by(PID, Time, Site, Rep) %>%
    summarise(totalCount = sum(Count)) %>%
    group_by(PID, Time, Site) %>% 
    summarise(mean(totalCount))


Or perhaps this, depending on exactly how the question is interpreted:

    data %>% group_by(PID, Time, Site) %>%
        summarise(totalCount = sum(Count), meanCount = mean(Count)  


Here is a full example of these alternatives versus @Ramnath proposed answer and the one @David Arenburg proposed in the comments , which I think is equivalent to the second dplyr statement.

nrow <- 510000
data <- data.frame(PID = sample(letters, nrow, replace = TRUE), 
                   Time = sample(letters, nrow, replace = TRUE),
                   Site = sample(letters, nrow, replace = TRUE),
                   Rep = rnorm(nrow),
                   Count = rpois(nrow, 100))


library(dplyr)
library(data.table)

Rprof(tf1 <- tempfile())
ans <- data %>% group_by(PID, Time, Site, Rep) %>%
    summarise(totalCount = sum(Count)) %>%
    group_by(PID, Time, Site) %>% 
    summarise(mean(totalCount))
Rprof()
summaryRprof(tf1)  #reports 1.68 sec sampling time

Rprof(tf2 <- tempfile())
ans <- data %>% group_by(PID, Time, Site, Rep) %>%
    summarise(total = sum(Count), meanCount = mean(Count)) 
Rprof()
summaryRprof(tf2)  # reports 1.60 seconds

Rprof(tf3 <- tempfile())
data_t = data.table(data)
ans = data_t[,list(A = sum(Count), B = mean(Count)), by = 'PID,Time,Site']
Rprof()
summaryRprof(tf3)  #reports 0.06 seconds

Rprof(tf4 <- tempfile())
ans <- setDT(data)[,.(A = sum(Count), B = mean(Count)), by = 'PID,Time,Site']
Rprof()
summaryRprof(tf4)  #reports 0.02 seconds


The data table method is much faster, and the setDT is even faster!
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
                             
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复