Fastest way to replace NAs in a large data.table

后端 未结 10 976
走了就别回头了
走了就别回头了 2020-11-22 17:10

I have a large data.table, with many missing values scattered throughout its ~200k rows and 200 columns. I would like to re code those NA values to zeros as efficiently as

10条回答
  •  遇见更好的自我
    2020-11-22 17:31

    Here's a solution using data.table's := operator, building on Andrie and Ramnath's answers.

    require(data.table)  # v1.6.6
    require(gdata)       # v2.8.2
    
    set.seed(1)
    dt1 = create_dt(2e5, 200, 0.1)
    dim(dt1)
    [1] 200000    200    # more columns than Ramnath's answer which had 5 not 200
    
    f_andrie = function(dt) remove_na(dt)
    
    f_gdata = function(dt, un = 0) gdata::NAToUnknown(dt, un)
    
    f_dowle = function(dt) {     # see EDIT later for more elegant solution
      na.replace = function(v,value=0) { v[is.na(v)] = value; v }
      for (i in names(dt))
        eval(parse(text=paste("dt[,",i,":=na.replace(",i,")]")))
    }
    
    system.time(a_gdata = f_gdata(dt1)) 
       user  system elapsed 
     18.805  12.301 134.985 
    
    system.time(a_andrie = f_andrie(dt1))
    Error: cannot allocate vector of size 305.2 Mb
    Timing stopped at: 14.541 7.764 68.285 
    
    system.time(f_dowle(dt1))
      user  system elapsed 
     7.452   4.144  19.590     # EDIT has faster than this
    
    identical(a_gdata, dt1)   
    [1] TRUE
    

    Note that f_dowle updated dt1 by reference. If a local copy is required then an explicit call to the copy function is needed to make a local copy of the whole dataset. data.table's setkey, key<- and := do not copy-on-write.

    Next, let's see where f_dowle is spending its time.

    Rprof()
    f_dowle(dt1)
    Rprof(NULL)
    summaryRprof()
    $by.self
                      self.time self.pct total.time total.pct
    "na.replace"           5.10    49.71       6.62     64.52
    "[.data.table"         2.48    24.17       9.86     96.10
    "is.na"                1.52    14.81       1.52     14.81
    "gc"                   0.22     2.14       0.22      2.14
    "unique"               0.14     1.36       0.16      1.56
    ... snip ...
    

    There, I would focus on na.replace and is.na, where there are a few vector copies and vector scans. Those can fairly easily be eliminated by writing a small na.replace C function that updates NA by reference in the vector. That would at least halve the 20 seconds I think. Does such a function exist in any R package?

    The reason f_andrie fails may be because it copies the whole of dt1, or creates a logical matrix as big as the whole of dt1, a few times. The other 2 methods work on one column at a time (although I only briefly looked at NAToUnknown).

    EDIT (more elegant solution as requested by Ramnath in comments) :

    f_dowle2 = function(DT) {
      for (i in names(DT))
        DT[is.na(get(i)), (i):=0]
    }
    
    system.time(f_dowle2(dt1))
      user  system elapsed 
     6.468   0.760   7.250   # faster, too
    
    identical(a_gdata, dt1)   
    [1] TRUE
    

    I wish I did it that way to start with!

    EDIT2 (over 1 year later, now)

    There is also set(). This can be faster if there are a lot of column being looped through, as it avoids the (small) overhead of calling [,:=,] in a loop. set is a loopable :=. See ?set.

    f_dowle3 = function(DT) {
      # either of the following for loops
    
      # by name :
      for (j in names(DT))
        set(DT,which(is.na(DT[[j]])),j,0)
    
      # or by number (slightly faster than by name) :
      for (j in seq_len(ncol(DT)))
        set(DT,which(is.na(DT[[j]])),j,0)
    }
    

提交回复
热议问题