Fastest way to reshape variable values as columns

前端未结

关注

 2  1484

I have a dataset with about 3 million rows and the following structure:

PatientID| Year | PrimaryConditionGroup
---------------------------------------
1


                      
              相关标签:


      
      
        
          2条回答        

        
                         				            
            
           
            
                              
                
              
              
                
                  春和景丽        
                
              
                            
                2021-02-03 11:26
              
            
            
                                                                       
There are probably more succinct ways of doing this, but for sheer speed, it's hard to beat a data.table-based solution:

df <- read.table(text="PatientID Year  PrimaryConditionGroup
1         Y1    TRAUMA
1         Y1    PREGNANCY
2         Y2    SEIZURE
3         Y1    TRAUMA", header=T)

library(data.table)
dt <- data.table(df, key=c("PatientID", "Year"))

dt[ , list(TRAUMA =    sum(PrimaryConditionGroup=="TRAUMA"),
           PREGNANCY = sum(PrimaryConditionGroup=="PREGNANCY"),
           SEIZURE =   sum(PrimaryConditionGroup=="SEIZURE")),
   by = list(PatientID, Year)]

#      PatientID Year TRAUMA PREGNANCY SEIZURE
# [1,]         1   Y1      1         1       0
# [2,]         2   Y2      0         0       1
# [3,]         3   Y1      1         0       0


EDIT: aggregate() provides a 'base R' solution that might or might not be more idiomatic. (The sole complication is that aggregate returns a matrix, rather than a data.frame; the second line below fixes that up.)

out <- aggregate(PrimaryConditionGroup ~ PatientID + Year, data=df, FUN=table)
out <- cbind(out[1:2], data.frame(out[3][[1]]))


2nd EDIT Finally, a succinct solution using the reshape package gets you to the same place.

library(reshape)
mdf <- melt(df, id=c("PatientID", "Year"))
cast(PatientID + Year ~ value, data=j, fun.aggregate=length)

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  星月不相逢        
                
              
                            
                2021-02-03 11:36
              
            
            
                                                                       
There are fast melt and dcast data.table specific methods implemented in C, in versions >=1.9.0. Here's a comparison with other excellent answers from @Josh's post on 3-million row data (just excluding base:::aggregate as it was taking quite sometime).

For more info on NEWS entry, go here.

I'll assume you've 1000 patients and 5 years in total. You can adjust the variables patients and year accordingly.

require(data.table) ## >= 1.9.0
require(reshape2)

set.seed(1L)
patients = 1000L
year = 5L
n = 3e6L
condn = c("TRAUMA", "PREGNANCY", "SEIZURE")

# dummy data
DT <- data.table(PatientID = sample(patients, n, TRUE),
                 Year = sample(year, n, TRUE), 
                 PrimaryConditionGroup = sample(condn, n, TRUE))

DT_dcast <- function(DT) {
    dcast.data.table(DT, PatientID ~ Year, fun.aggregate=length)
}

reshape2_dcast <- function(DT) {
    reshape2:::dcast(DT, PatientID ~ Year, fun.aggregate=length)
}

DT_raw <- function(DT) {
    DT[ , list(TRAUMA = sum(PrimaryConditionGroup=="TRAUMA"),
            PREGNANCY = sum(PrimaryConditionGroup=="PREGNANCY"),
              SEIZURE = sum(PrimaryConditionGroup=="SEIZURE")),
    by = list(PatientID, Year)]
}

# system.time(.) timed 3 times
#         Method Time_rep1 Time_rep2 Time_rep3
#       dcast_DT     0.393     0.399     0.396
#    reshape2_DT     3.784     3.457     3.605
#         DT_raw     0.647     0.680     0.657


dcast.data.table is about 1.6x faster than normal aggregation using data.table and 8.8x faster than reshape2:::dcast.
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
                             
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复