Fastest way to reshape variable values as columns

前端 未结 2 1484
春和景丽
春和景丽 2021-02-03 11:17

I have a dataset with about 3 million rows and the following structure:

PatientID| Year | PrimaryConditionGroup
---------------------------------------
1                 


        
相关标签:
2条回答
  • 2021-02-03 11:26

    There are probably more succinct ways of doing this, but for sheer speed, it's hard to beat a data.table-based solution:

    df <- read.table(text="PatientID Year  PrimaryConditionGroup
    1         Y1    TRAUMA
    1         Y1    PREGNANCY
    2         Y2    SEIZURE
    3         Y1    TRAUMA", header=T)
    
    library(data.table)
    dt <- data.table(df, key=c("PatientID", "Year"))
    
    dt[ , list(TRAUMA =    sum(PrimaryConditionGroup=="TRAUMA"),
               PREGNANCY = sum(PrimaryConditionGroup=="PREGNANCY"),
               SEIZURE =   sum(PrimaryConditionGroup=="SEIZURE")),
       by = list(PatientID, Year)]
    
    #      PatientID Year TRAUMA PREGNANCY SEIZURE
    # [1,]         1   Y1      1         1       0
    # [2,]         2   Y2      0         0       1
    # [3,]         3   Y1      1         0       0
    

    EDIT: aggregate() provides a 'base R' solution that might or might not be more idiomatic. (The sole complication is that aggregate returns a matrix, rather than a data.frame; the second line below fixes that up.)

    out <- aggregate(PrimaryConditionGroup ~ PatientID + Year, data=df, FUN=table)
    out <- cbind(out[1:2], data.frame(out[3][[1]]))
    

    2nd EDIT Finally, a succinct solution using the reshape package gets you to the same place.

    library(reshape)
    mdf <- melt(df, id=c("PatientID", "Year"))
    cast(PatientID + Year ~ value, data=j, fun.aggregate=length)
    
    0 讨论(0)
  • 2021-02-03 11:36

    There are fast melt and dcast data.table specific methods implemented in C, in versions >=1.9.0. Here's a comparison with other excellent answers from @Josh's post on 3-million row data (just excluding base:::aggregate as it was taking quite sometime).

    For more info on NEWS entry, go here.

    I'll assume you've 1000 patients and 5 years in total. You can adjust the variables patients and year accordingly.

    require(data.table) ## >= 1.9.0
    require(reshape2)
    
    set.seed(1L)
    patients = 1000L
    year = 5L
    n = 3e6L
    condn = c("TRAUMA", "PREGNANCY", "SEIZURE")
    
    # dummy data
    DT <- data.table(PatientID = sample(patients, n, TRUE),
                     Year = sample(year, n, TRUE), 
                     PrimaryConditionGroup = sample(condn, n, TRUE))
    
    DT_dcast <- function(DT) {
        dcast.data.table(DT, PatientID ~ Year, fun.aggregate=length)
    }
    
    reshape2_dcast <- function(DT) {
        reshape2:::dcast(DT, PatientID ~ Year, fun.aggregate=length)
    }
    
    DT_raw <- function(DT) {
        DT[ , list(TRAUMA = sum(PrimaryConditionGroup=="TRAUMA"),
                PREGNANCY = sum(PrimaryConditionGroup=="PREGNANCY"),
                  SEIZURE = sum(PrimaryConditionGroup=="SEIZURE")),
        by = list(PatientID, Year)]
    }
    
    # system.time(.) timed 3 times
    #         Method Time_rep1 Time_rep2 Time_rep3
    #       dcast_DT     0.393     0.399     0.396
    #    reshape2_DT     3.784     3.457     3.605
    #         DT_raw     0.647     0.680     0.657
    

    dcast.data.table is about 1.6x faster than normal aggregation using data.table and 8.8x faster than reshape2:::dcast.

    0 讨论(0)
提交回复
热议问题