How should I drop blocks of NAs from an R data.table

后端未结

关注

 2  1008

I have a large R data.table with a multi column key, where some value columns contain some NA. I\'d like to remove groups that are entirely NA in one or more value

相关标签:

2条回答

生来不讨喜

2021-02-05 20:09

You can do this to get those entries where not ALL Value are NA:

setkey(DT, "Series")
DT[, .SD[(!all(is.na(Value)))], by=Series]

The parens around !all are needed to avoid not-join syntax which Matthew will look into (see comments). Same as this :

DT[, .SD[as.logical(!all(is.na(Value)))], by=Series]

Building on that to answer the new clarified question :

allNA = function(x) all(is.na(x))     # define helper function
for (i in c("Id","Series"))
    DT = DT[, if (!any(sapply(.SD,allNA))) .SD else NULL, by=i]
DT
    Series Id Value1 Value2
 1:      i  1      1      1
 2:      i  2      2      2
 3:      i  3      3      3
 4:      b  5      5      2
 5:      b  6      6      3
 6:      f  5      5      2
 7:      f  6      6      3
 8:      j  5      5      5
 9:      j  6      6      6
10:      c  7      7      4
11:      c  8      8     NA
12:      c  9      9      6
13:      g  7      7      4
14:      g  8      8      5
15:      g  9      9      6
16:      k  7      7      7
17:      k  8      8      8
18:      k  9      9      9

That changes the order, though. So isn't precisely the result requested. The following keeps the order and should be faster too.

# starting fresh from original DT in question again
DT[,drop:=FALSE]
for (i in c("Series","Id"))
    DT[,drop:=drop|any(sapply(.SD,allNA)),by=i]
DT[(!drop)][,drop:=NULL][]
    Series Id Value1 Value2
 1:      b  5      5      2
 2:      b  6      6      3
 3:      c  7      7      4
 4:      c  8      8     NA
 5:      c  9      9      6
 6:      f  5      5      2
 7:      f  6      6      3
 8:      g  7      7      4
 9:      g  8      8      5
10:      g  9      9      6
11:      i  1      1      1
12:      i  2      2      2
13:      i  3      3      3
14:      j  5      5      5
15:      j  6      6      6
16:      k  7      7      7
17:      k  8      8      8
18:      k  9      9      9

0 讨论(0)

花落未央

2021-02-05 20:25

What about using complete.cases function ?

DT[complete.cases(DT),]

It will drop the rows that have a column value with NA

> DT[complete.cases(DT),]
    Series Id Value1 Value2
 1:      b  4      4      1
 2:      b  5      5      2
 3:      b  6      6      3
 4:      c  7      7      4
 5:      c  8      8      5
 6:      c  9      9      6
 7:      f  4      4      1
 8:      f  5      5      2
 9:      f  6      6      3
10:      g  7      7      4
11:      g  8      8      5
12:      g  9      9      6
13:      j  4      4      1
14:      j  5      5      2
15:      j  6      6      3
16:      k  7      7      4
17:      k  8      8      5
18:      k  9      9      6

0 讨论(0)