When subsetting rows with a factor with equal (==), NA's are also included. It doesn't happen with %in%. Is it normal?

◇◆丶佛笑我妖孽 提交于 2020-01-03 11:49:04

问题


Suppose I have a factor A with 3 levels A1, A2, A3 and with NA's. Each appears in 10 cases, so there is a total of 40 cases. If I do

subset1 <- df[df$A=="A1",]  
dim(subset1)  # 20, i.e., 10 for A1 and 10 for NA's
summary(subset1$A) # both A1 and NA have non-zero counts
subset2 <- df[df$A %in% c("A1"),] 
dim(subset2)  # 10, as expected
summary(subset2$A) # only A1 has non-zero count

And it is the same whether the class of the variable used for subsetting is factor or integer. Is it just how equal (and >, <) works? So should I just stick to %in% for factors and always include !is.na when using equal? Thanks!


回答1:


Yes, the return types of == and %in% are different with respect to NA because of how "%in%" is defined...

# Data...
x <- c("A",NA,"A")

# When NA is encountered NA is returned
# Philosophically correct - who knows if the
# missing value at NA is equal to "A"?!
x=="A"
#[1] TRUE   NA TRUE
x[x=="A"]
#[1] "A" NA  "A"

# When NA is encountered by %in%, FALSE is returned, rather than NA
x %in% "A"
#[1]  TRUE FALSE  TRUE
x[ x %in% "A" ]
#[1] "A" "A"

This is because (from the docs)...

%in% is an alias for match, which is defined as

"%in%" <- function(x, table) match(x, table, nomatch = 0) > 0

If we redefine it to the standard definition of match you will see that it behaves in the same way as ==

"%in2%" <- function(x,table) match(x, table, nomatch = NA_integer_) > 0
x %in2% "A"
#[1] TRUE   NA TRUE



回答2:


There's a mismatch here between what you want (only the entries that match your filtering) and what R does.

The difference is that when the selection vector includes an NA, the corresponding entry yields an output, but the value is NA. The logical tests that you run yield NAs, which is where the problem occurs.

Consider these cases:

x <- 1:10
y <- x
y[4] <- NA
ix1 <- which(x < 5)
ix2 <- which(y < 5)
x[ix1]
y[ix2]

Versus:

x[x < 5]
y[y < 5]

And

y < 5

It is because of this behavior that I almost never use v[logicalCondition] and instead add an additional command to select the entries, e.g. ixSelect <- which(logicalCondition). If you want NAs, you can use which(logicalCondition | is.na(v)).



来源:https://stackoverflow.com/questions/24040758/when-subsetting-rows-with-a-factor-with-equal-nas-are-also-included-it-d

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!