Fastest way to reshape variable values as columns

前端未结

关注

 2  1483

春和景丽 2021-02-03 11:17

I have a dataset with about 3 million rows and the following structure:

PatientID| Year | PrimaryConditionGroup
---------------------------------------
1


      
      
        
          2条回答        

        
                    
            
            
                         
                
              
              
                
                   星月不相逢
                                             
                
                
                (楼主)
            
              
              
                2021-02-03 11:36
              

            
            
                        
There are fast melt and dcast data.table specific methods implemented in C, in versions >=1.9.0. Here's a comparison with other excellent answers from @Josh's post on 3-million row data (just excluding base:::aggregate as it was taking quite sometime).

For more info on NEWS entry, go here.

I'll assume you've 1000 patients and 5 years in total. You can adjust the variables patients and year accordingly.

require(data.table) ## >= 1.9.0
require(reshape2)

set.seed(1L)
patients = 1000L
year = 5L
n = 3e6L
condn = c("TRAUMA", "PREGNANCY", "SEIZURE")

# dummy data
DT <- data.table(PatientID = sample(patients, n, TRUE),
                 Year = sample(year, n, TRUE), 
                 PrimaryConditionGroup = sample(condn, n, TRUE))

DT_dcast <- function(DT) {
    dcast.data.table(DT, PatientID ~ Year, fun.aggregate=length)
}

reshape2_dcast <- function(DT) {
    reshape2:::dcast(DT, PatientID ~ Year, fun.aggregate=length)
}

DT_raw <- function(DT) {
    DT[ , list(TRAUMA = sum(PrimaryConditionGroup=="TRAUMA"),
            PREGNANCY = sum(PrimaryConditionGroup=="PREGNANCY"),
              SEIZURE = sum(PrimaryConditionGroup=="SEIZURE")),
    by = list(PatientID, Year)]
}

# system.time(.) timed 3 times
#         Method Time_rep1 Time_rep2 Time_rep3
#       dcast_DT     0.393     0.399     0.396
#    reshape2_DT     3.784     3.457     3.605
#         DT_raw     0.647     0.680     0.657


dcast.data.table is about 1.6x faster than normal aggregation using data.table and 8.8x faster than reshape2:::dcast.
    
             
                                                        
            
            
              
                
                0
              
                   
                
               讨论(0)
              
                                                  
              
              
                          
             
       
          
              
                                       
     查看其它2个回答


            
                         
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
                              			
        
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复