I have a large data.table, with many missing values scattered throughout its ~200k rows and 200 columns. I would like to re code those NA values to zeros as efficiently as
Here's the simplest one I could come up with:
dt[is.na(dt)] <- 0
It's efficient and no need to write functions and other glue code.
library(data.table)
DT = data.table(a=c(1,"A",NA),b=c(4,NA,"B"))
DT
a b
1: 1 4
2: A NA
3: NA B
DT[,lapply(.SD,function(x){ifelse(is.na(x),0,x)})]
a b
1: 1 4
2: A 0
3: 0 B
Just for reference, slower compared to gdata or data.matrix, but uses only the data.table package and can deal with non numerical entries.
For the sake of completeness, another way to replace NAs with 0 is to use
f_rep <- function(dt) {
dt[is.na(dt)] <- 0
return(dt)
}
To compare results and times I have incorporated all approaches mentioned so far.
set.seed(1)
dt1 <- create_dt(2e5, 200, 0.1)
dt2 <- dt1
dt3 <- dt1
system.time(res1 <- f_gdata(dt1))
User System verstrichen
3.62 0.22 3.84
system.time(res2 <- f_andrie(dt1))
User System verstrichen
2.95 0.33 3.28
system.time(f_dowle2(dt2))
User System verstrichen
0.78 0.00 0.78
system.time(f_dowle3(dt3))
User System verstrichen
0.17 0.00 0.17
system.time(res3 <- f_unknown(dt1))
User System verstrichen
6.71 0.84 7.55
system.time(res4 <- f_rep(dt1))
User System verstrichen
0.32 0.00 0.32
identical(res1, res2) & identical(res2, res3) & identical(res3, res4) & identical(res4, dt2) & identical(dt2, dt3)
[1] TRUE
So the new approach is slightly slower than f_dowle3
but faster than all the other approaches. But to be honest, this is against my Intuition of the data.table Syntax and I have no idea why this works. Can anybody enlighten me?
My understanding is that the secret to fast operations in R is to utilise vector (or arrays, which are vectors under the hood.)
In this solution I make use of a data.matrix
which is an array
but behave a bit like a data.frame
. Because it is an array, you can use a very simple vector substitution to replace the NA
s:
A little helper function to remove the NA
s. The essence is a single line of code. I only do this to measure execution time.
remove_na <- function(x){
dm <- data.matrix(x)
dm[is.na(dm)] <- 0
data.table(dm)
}
A little helper function to create a data.table
of a given size.
create_dt <- function(nrow=5, ncol=5, propNA = 0.5){
v <- runif(nrow * ncol)
v[sample(seq_len(nrow*ncol), propNA * nrow*ncol)] <- NA
data.table(matrix(v, ncol=ncol))
}
Demonstration on a tiny sample:
library(data.table)
set.seed(1)
dt <- create_dt(5, 5, 0.5)
dt
V1 V2 V3 V4 V5
[1,] NA 0.8983897 NA 0.4976992 0.9347052
[2,] 0.3721239 0.9446753 NA 0.7176185 0.2121425
[3,] 0.5728534 NA 0.6870228 0.9919061 NA
[4,] NA NA NA NA 0.1255551
[5,] 0.2016819 NA 0.7698414 NA NA
remove_na(dt)
V1 V2 V3 V4 V5
[1,] 0.0000000 0.8983897 0.0000000 0.4976992 0.9347052
[2,] 0.3721239 0.9446753 0.0000000 0.7176185 0.2121425
[3,] 0.5728534 0.0000000 0.6870228 0.9919061 0.0000000
[4,] 0.0000000 0.0000000 0.0000000 0.0000000 0.1255551
[5,] 0.2016819 0.0000000 0.7698414 0.0000000 0.0000000