Replacing all missing values in R data.table with a value

前端 未结 4 1627
清歌不尽
清歌不尽 2020-12-04 14:05

If you have an R data.table that has missing values, how do you replace all of them with say, the value 0? E.g.

aa = data.table(V1=1:10,V2=c(1,2,2,3,3,3,4,4,         


        
相关标签:
4条回答
  • 2020-12-04 14:20

    The specific problem OP is posting could also be solved by

    tt[is.na(X), X := 0]
    
    0 讨论(0)
  • 2020-12-04 14:29

    Nothing unusual here:

    tt[is.na(tt)] = 0
    

    ..will work.

    This is somewhat confusing however given that:

    tt[is.na(tt)]
    

    ...currently returns:

    Error in [.data.table(tt, is.na(tt)) : i is invalid type (matrix). Perhaps in future a 2 column matrix could return a list of elements of DT (in the spirit of A[B] in FAQ 2.14). Please let datatable-help know if you'd like this, or add your comments to FR #1611.

    0 讨论(0)
  • 2020-12-04 14:30

    I would make use of data.table and lapply, namely:

    tt[,lapply(.SD,function(kkk) ifelse(is.na(kkk),-666,kkk)),.SDcols=names(tt)]
    

    yielding in:

    V1    X V2
     1:  1 -666  1
     2:  2 -666  2
     3:  3    a  2
     4:  4    b  3
     5:  5    c  3
     6:  6    d  3
     7:  7 -666  4
     8:  8 -666  4
     9:  9 -666  4
    10: 10 -666  4
    
    0 讨论(0)
  • 2020-12-04 14:38

    is.na (being a primitive) has relatively very less overhead and is usually quite fast. So, you can just loop through the columns and use set to replace NA with0`.

    Using <- to assign will result in a copy of all the columns and this is not the idiomatic way using data.table.

    First I'll illustrate as to how to do it and then show how slow this can get on huge data (due to the copy):

    One way to do this efficiently:

    for (i in seq_along(tt)) set(tt, i=which(is.na(tt[[i]])), j=i, value=0)
    

    You'll get a warning here that "0" is being coerced to character to match the type of column. You can ignore it.

    Why shouldn't you use <- here:

    # by reference - idiomatic way
    set.seed(45)
    tt <- data.table(matrix(sample(c(NA, rnorm(10)), 1e7*3, TRUE), ncol=3))
    tracemem(tt)
    # modifies value by reference - no copy
    system.time({
    for (i in seq_along(tt)) 
        set(tt, i=which(is.na(tt[[i]])), j=i, value=0)
    })
    #   user  system elapsed 
    #  0.284   0.083   0.386 
    
    # by copy - NOT the idiomatic way
    set.seed(45)
    tt <- data.table(matrix(sample(c(NA, rnorm(10)), 1e7*3, TRUE), ncol=3))
    tracemem(tt)
    # makes copy
    system.time({tt[is.na(tt)] <- 0})
    # a bunch of "tracemem" output showing the copies being made
    #   user  system elapsed 
    #  4.110   0.976   5.187 
    
    0 讨论(0)
提交回复
热议问题