Unable to subset (filter) a data frame due to NA's

淺唱寂寞╮ 提交于 2021-02-08 19:06:34

问题


Why in the code below dplyr's filter doesn't return the same data.frame as base R subsetting?

In fact none of them works as expected. I'd like to remove observations/rows which, simultaneously, b==1 AND c==1. That is, I'd like to remove only the third row.

require(dplyr)
df <- data.frame(a=c(0,0,0,0,1,1,1),
  b=c(0,0,1,1,0,0,1),
  c=c(1,NA,1,NA,1,NA,NA))

filter(df, !(b==1 & c==1))

df[!(df$b==1 & df$c==1),]

回答1:


Or use complete.cases to convert NA to FALSE in the result logic vector so that you can pick the corresponding rows up after the negation, and this uses the fact that NA & F = F:

filter(df, !(b == 1 & c == 1 & complete.cases(df[c('b', 'c')])))

#   a b  c
# 1 0 0  1
# 2 0 0 NA
# 3 0 1 NA
# 4 1 0  1
# 5 1 0 NA
# 6 1 1 NA

More logical operations with NA involved here, which is a little bit confusing at the first glance but they are following the logic:

NA & F
# [1] FALSE
NA | T
# [1] TRUE
NA & T
# [1] NA
NA | F
# [1] NA



回答2:


This is the simplest option I can think of:

filter(df, !((b==1 & c==1) %in% TRUE))
#  a b  c
#1 0 0  1
#2 0 0 NA
#3 0 1 NA
#4 1 0  1
#5 1 0 NA
#6 1 1 NA

# or equivalently in data.table
dt[!((b==1 & c==1) %in% TRUE)]

Another, perhaps more verbose/clear option is to use !(b==1 & c==1) | is.na(b+c) as the comparison.




回答3:


Using data.table

library(data.table)
setDT(df)[df[,!(b==1 & c== 1& complete.cases(.SD[, c('b', 'c'), with = FALSE]))]]
#   a b  c
#1: 0 0  1
#2: 0 0 NA
#3: 0 1 NA
#4: 1 0  1
#5: 1 0 NA
#6: 1 1 NA



回答4:


Yes, the NA values cause problems. Here's 4 workarounds:

Method 1: 2-step Exclusion

n <- (df$b+df$c==2)
df[n %in% c(NA, "FALSE"),]
  a b  c
1 0 0  1
2 0 0 NA
4 0 1 NA
5 1 0  1
6 1 0 NA
7 1 1 NA

Method 2: Conditional Sum

df[!(complete.cases(df$b,df$c) & df$b+df$c == 2),]
  a b  c
1 0 0  1
2 0 0 NA
4 0 1 NA
5 1 0  1
6 1 0 NA
7 1 1 NA

Method 3: Loop/Function

filterwithNA <- function(df,n){
  for(i in 1:nrow(df)){
    if(!is.na(df$b[i]) & !(is.na(df$c[i]))){
      if(df$b[i] == n & df$c[i] == n){
        df <- df[-i,]
      }
    }
  }
  return(df)
}

filterwithNA(df, n=1)
  a b  c
1 0 0  1
2 0 0 NA
4 0 1 NA
5 1 0  1
6 1 0 NA
7 1 1 NA

Method 4: Temporary numeric replacement

df[is.na(df)] <- 999

df[!(df$b==1 & df$c==1),]
df[df==999] <- NA
df
  a b  c
1 0 0  1
2 0 0 NA
4 0 1 NA
5 1 0  1
6 1 0 NA
7 1 1 NA


来源:https://stackoverflow.com/questions/38948196/unable-to-subset-filter-a-data-frame-due-to-nas

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!