How to subset data in R without losing NA rows?

后端未结

关注

 3  1549

無奈伤痛

I have some data that I am looking at in R. One particular column, titled \"Height\", contains a few rows of NA.

I am looking to subset my data-frame so that all He

相关标签:

3条回答

死守一世寂寞

2020-11-29 11:17
You could also do:
```
df2 <- df1[(df1$Height < 40 | is.na(df1$Height)),]
```
0 讨论(0)
发布评论:

提交评论
- 加载中...
陌清茗

2020-11-29 11:24
If we decide to use subset function, then we need to watch out:
```
For ordinary vectors, the result is simply ‘x[subset & !is.na(subset)]’.
```
So only non-NA values will be retained.

If you want to keep NA cases, use logical or condition to tell R not to drop NA cases:
```
subset(df1, Height < 40 | is.na(Height))
# or `df1[df1$Height < 40 | is.na(df1$Height), ]`
```
Don't use directly (to be explained soon):
```
df2 <- df1[df1$Height < 40, ]
```
Example
```
df1 <- data.frame(Height = c(NA, 2, 4, NA, 50, 60), y = 1:6)

subset(df1, Height < 40 | is.na(Height))

#  Height y
#1     NA 1
#2      2 2
#3      4 3
#4     NA 4

df1[df1$Height < 40, ]

#  Height  y
#1     NA NA
#2      2  2
#3      4  3
#4     NA NA
```
The reason that the latter fails, is that indexing by NA gives NA. Consider this simple example with a vector:
```
x <- 1:4
ind <- c(NA, TRUE, NA, FALSE)
x[ind]
# [1] NA  2 NA
```
We need to somehow replace those NA with TRUE. The most straightforward way is to add another "or" condition is.na(ind):
```
x[ind | is.na(ind)]
# [1] 1 2 3
```
This is exactly what will happen in your situation. If your Height contains NA, then logical operation Height < 40 ends up a mix of TRUE / FALSE / NA, so we need replace NA by TRUE as above.
0 讨论(0)
发布评论:

提交评论
- 加载中...

無奈伤痛

2020-11-29 11:27

For subsetting by character/factor variables, you can use %in% to keep NAs. Specify the data you wish to exclude.

# Create Dataset
library(data.table)
df=data.table(V1=c('Surface','Bottom',NA),V2=1:3)
df
#         V1 V2
# 1: Surface  1
# 2:  Bottom  2
# 3:    <NA>  3

# Keep all but 'Bottom'
df[!V1 %in% c('Bottom')]
#         V1 V2
# 1: Surface  1
# 2:    <NA>  3

This works because %in% never returns an NA (see ?match)

0 讨论(0)