Fastest way to reshape variable values as columns

前端 未结 2 1483
春和景丽
春和景丽 2021-02-03 11:17

I have a dataset with about 3 million rows and the following structure:

PatientID| Year | PrimaryConditionGroup
---------------------------------------
1                 


        
2条回答
  •  星月不相逢
    2021-02-03 11:36

    There are fast melt and dcast data.table specific methods implemented in C, in versions >=1.9.0. Here's a comparison with other excellent answers from @Josh's post on 3-million row data (just excluding base:::aggregate as it was taking quite sometime).

    For more info on NEWS entry, go here.

    I'll assume you've 1000 patients and 5 years in total. You can adjust the variables patients and year accordingly.

    require(data.table) ## >= 1.9.0
    require(reshape2)
    
    set.seed(1L)
    patients = 1000L
    year = 5L
    n = 3e6L
    condn = c("TRAUMA", "PREGNANCY", "SEIZURE")
    
    # dummy data
    DT <- data.table(PatientID = sample(patients, n, TRUE),
                     Year = sample(year, n, TRUE), 
                     PrimaryConditionGroup = sample(condn, n, TRUE))
    
    DT_dcast <- function(DT) {
        dcast.data.table(DT, PatientID ~ Year, fun.aggregate=length)
    }
    
    reshape2_dcast <- function(DT) {
        reshape2:::dcast(DT, PatientID ~ Year, fun.aggregate=length)
    }
    
    DT_raw <- function(DT) {
        DT[ , list(TRAUMA = sum(PrimaryConditionGroup=="TRAUMA"),
                PREGNANCY = sum(PrimaryConditionGroup=="PREGNANCY"),
                  SEIZURE = sum(PrimaryConditionGroup=="SEIZURE")),
        by = list(PatientID, Year)]
    }
    
    # system.time(.) timed 3 times
    #         Method Time_rep1 Time_rep2 Time_rep3
    #       dcast_DT     0.393     0.399     0.396
    #    reshape2_DT     3.784     3.457     3.605
    #         DT_raw     0.647     0.680     0.657
    

    dcast.data.table is about 1.6x faster than normal aggregation using data.table and 8.8x faster than reshape2:::dcast.

提交回复
热议问题