Data.table: how to get the blazingly fast subsets it promises and apply to a second data.table

后端 未结 2 1358
难免孤独
难免孤独 2021-01-15 00:55

I\'m trying to enrich one dataset (adherence) based on subsets from another (lsr). For each individual row in adherence, I want to calculate (as a third column) the medicati

相关标签:
2条回答
  • 2021-01-15 01:16

    This can be solved by updating in a non-equi join.

    This avoids the memory issues caused by a cartesian join or by calling apply() which coerces a data.frame or data.table to a matrix which involves copying the data.

    In addition, the OP has mentioned that lsr has a few hundred mio. rows and adherence has 1.5 mio rows (500 timeperiods times 3000 ID's). Therefore, efficient storage of data items will not only reduce the memory footprint but may also reduce the share of processing time which is required for loading data.

    library(data.table)
    # coerce to data.table by reference, i.e., without copying
    setDT(adherence)
    setDT(lsr)
    # coerce to IDate to save memory
    adherence[, year := as.IDate(year)]
    cols <- c("eksd", "ENDDATE")
    lsr[, (cols) := lapply(.SD, as.IDate), .SDcols = cols]
    # update in a non-equi join
    adherence[lsr, on = .(ID, year >= eksd, year < ENDDATE), 
                          AH := as.integer(ENDDATE - x.year)][]
    
       ID       year AH
    1:  1 2013-01-01 NA
    2:  2 2013-01-01 NA
    3:  3 2013-01-01 NA
    4:  1 2013-02-01 64
    5:  2 2013-02-01 NA
    6:  3 2013-02-01 63
    

    Note that NA indicates that no match was found. If required, the AH column can be initialised before the non-equi join by adherence[, AH := 0L].

    Data

    The code to create the sample datasets can be streamlined:

    adherence <- data.frame(
      ID = c("1", "2", "3", "1", "2", "3"), 
      year = as.Date(c("2013-01-01", "2013-01-01", "2013-01-01", "2013-02-01", "2013-02-01", "2013-02-01")),
      stringsAsFactors = FALSE)
    
    lsr <- data.frame(
      ID = c("1", "1", "1", "2", "2", "2", "3", "3"),
      eksd = as.Date(c("2012-03-01", "2012-08-02", "2013-01-06","2012-08-25", "2013-03-22", "2013-09-15", "2011-01-01", "2013-01-05")),
      DDD = as.integer(c("60", "90", "90", "60", "120", "60", "30", "90")),
      stringsAsFactors = FALSE)
    lsr$ENDDATE <- lsr$eksd + lsr$DDD
    

    Note that DDD is of type integer which usually requires 4 bytes instead of 8 bytes for type numeric/double.

    Also note that the last statement may cause the whole data object lsr to be copied. This can be avoided by using data.table syntax which updates by reference.

    library(data.table)
    setDT(lsr)[, ENDDATE := eksd + DDD][]
    
    0 讨论(0)
  • 2021-01-15 01:30

    I am not sure why your function is slow (I think you could remove your ifelse function), but I would propose to use merge to be faster and to operate on one table only:

    plouf <- lsr[adherence, on = "ID", allow.cartesian=TRUE]
    plouf[,year := as.date(year)]
    bob <- rbindlist(lapply(unique(adherence$year),function(x){
      plouf <- lsr[adherence[year == x], on = "ID"]
      plouf[,year := as.Date(year)]
      plouf[year >= eksd & year < ENDDATE,list(sum = sum(as.numeric(ENDDATE-as.Date(year))), year = year), by = ID]
      }))
    bob
    
       ID sum       year
    1:  1  64 2013-02-01
    2:  3  63 2013-02-01
    

    you can then merge to adherence

    adherence <- setDT(adherence)
    adherence[,year := as.Date(year)]
    bob[adherence, on = .(ID,year)]
       ID sum       year
    1:  1  NA 2013-01-01
    2:  2  NA 2013-01-01
    3:  3  NA 2013-01-01
    4:  1  64 2013-02-01
    5:  2  NA 2013-02-01
    6:  3  63 2013-02-01
    

    For reading your data use fread() function that is fast for big data

    0 讨论(0)
提交回复
热议问题