NA when trying to summarize a subset of data (R)

问题

Whole vector is ok and has no NAs:

> summary(data$marks)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   1.00    6.00    6.00    6.02    7.00    7.00

> length(data$marks)
[1] 2528

However, when trying to calculate a subset using a criteria I receive lots of NAs:

> summary(data[data$student=="John",]$marks)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
  1.000   6.000   6.000   6.169   7.000   7.000     464

> length(data[data$student=="John",]$marks)
[1] 523

回答1:

I think the problem is that you have missing values for student. As a result, when you subset by student, all the NA values for student end up producing NA for marks when you take your subset. Wrap the subsetting condition in which() to avoid this problem. Here are a few examples that will hopefully clarify what's happening:

# Fake data
set.seed(103)
dat = data.frame(group=rep(LETTERS[1:3], each=3), 
                 value=rnorm(9))
dat$group[1] = NA

dat$value
dat[dat$group=="B", "value"]
dat[which(dat$group=="B"), "value"]

# Simpler example
x = c(10,20,30,40, NA)

x>20
x[x>20]

which(x>20)
x[which(x>20)]

回答2:

First Note that NA=="foo" results in NA. When subsetting a vector with a NA value the result is NA.

t = c(1,2,3)
t[c(1,NA)]

回答3:

a tidyverse solution. I find these to be easier to read than base R.

library(tidyverse)

data %<%
  filter(student == "John") %<%
  summary(marks)

来源：https://stackoverflow.com/questions/34055552/na-when-trying-to-summarize-a-subset-of-data-r

标签

dataframe

missing-data

易学教程内所有资源均来自网络或用户发布的内容，如有违反法律规定的内容欢迎反馈！
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!