How to subset data in R without losing NA rows?

后端 未结 3 1548
無奈伤痛
無奈伤痛 2020-11-29 10:38

I have some data that I am looking at in R. One particular column, titled \"Height\", contains a few rows of NA.

I am looking to subset my data-frame so that all He

相关标签:
3条回答
  • 2020-11-29 11:17

    You could also do:

    df2 <- df1[(df1$Height < 40 | is.na(df1$Height)),]
    
    0 讨论(0)
  • 2020-11-29 11:24

    If we decide to use subset function, then we need to watch out:

    For ordinary vectors, the result is simply ‘x[subset & !is.na(subset)]’.
    

    So only non-NA values will be retained.

    If you want to keep NA cases, use logical or condition to tell R not to drop NA cases:

    subset(df1, Height < 40 | is.na(Height))
    # or `df1[df1$Height < 40 | is.na(df1$Height), ]`
    

    Don't use directly (to be explained soon):

    df2 <- df1[df1$Height < 40, ]
    

    Example

    df1 <- data.frame(Height = c(NA, 2, 4, NA, 50, 60), y = 1:6)
    
    subset(df1, Height < 40 | is.na(Height))
    
    #  Height y
    #1     NA 1
    #2      2 2
    #3      4 3
    #4     NA 4
    
    df1[df1$Height < 40, ]
    
    #  Height  y
    #1     NA NA
    #2      2  2
    #3      4  3
    #4     NA NA
    

    The reason that the latter fails, is that indexing by NA gives NA. Consider this simple example with a vector:

    x <- 1:4
    ind <- c(NA, TRUE, NA, FALSE)
    x[ind]
    # [1] NA  2 NA
    

    We need to somehow replace those NA with TRUE. The most straightforward way is to add another "or" condition is.na(ind):

    x[ind | is.na(ind)]
    # [1] 1 2 3
    

    This is exactly what will happen in your situation. If your Height contains NA, then logical operation Height < 40 ends up a mix of TRUE / FALSE / NA, so we need replace NA by TRUE as above.

    0 讨论(0)
  • 2020-11-29 11:27

    For subsetting by character/factor variables, you can use %in% to keep NAs. Specify the data you wish to exclude.

    # Create Dataset
    library(data.table)
    df=data.table(V1=c('Surface','Bottom',NA),V2=1:3)
    df
    #         V1 V2
    # 1: Surface  1
    # 2:  Bottom  2
    # 3:    <NA>  3
    
    # Keep all but 'Bottom'
    df[!V1 %in% c('Bottom')]
    #         V1 V2
    # 1: Surface  1
    # 2:    <NA>  3
    

    This works because %in% never returns an NA (see ?match)

    0 讨论(0)
提交回复
热议问题