发表新帖

发表新帖

Filtering rows in R unexpectedly removes NAs when using subset or dplyr::filter

前端未结

关注

 2  1875

I have a dataset df and I would like to remove all rows for which variable y does not have the value a. Variable y also c

相关标签:

2条回答

温柔的废话

2021-01-19 10:35
One workaround is to use %in%:
```
subset(df, !y %in% "a")
dplyr::filter(df, !y %in% "a")
```
0 讨论(0)
发布评论:

提交评论
- 加载中...
暗喜

2021-01-19 10:36
Your example of the "expected" behavior doesn't actually return what you display in your question. I get:
```
> df[df$y != 'a',]
    x    y
NA NA <NA>
3   3    c
```
This is arguably more wrong than what subset and dplyr::filter return. Remember that in R, NA really is intended to mean "unknown", so df$y != 'a' returns,
```
> df$y != 'a'
[1] FALSE    NA  TRUE
```
So R is being told you definitely don't want the first row, you do want the last row, but whether you want the second row is literally "unknown". As a result, it includes a row of all NAs.

Many people dislike this behavior, but it is what it is.

subset and dplyr::filter make a different default choice which is to simply drop the NA rows, which arguably is accurate-ish.

But really, the lesson here is that if your data has NAs, that just means you need to code defensively around that at all points, either by using conditions like is.na(df$y) | df$y != 'a', or as mentioned in the other answer by using %in% which is based on match.
0 讨论(0)
发布评论:

提交评论
- 加载中...

热议问题