I\'ve just started with R and I\'ve executed these statements:
library(datasets)
head(airquality)
s <- split(airquality,airquality$Month)
sapply(s, function(x
sapply(s, function(x) {colMeans(x[,c("Ozone", "Solar.R", "Wind")], na.rm = TRUE)})
treats each column individually, and calculates the average of the non-NA values in each column.
lapply(s, function(x) {colMeans(na.omit(x[,c("Ozone", "Solar.R", "Wind")])) })
subsets s
to those cases where none of the three columns are NA
, and then takes the column means for the resulting data.
The difference comes from those rows which have one or two of the values as NA
.
They are not supposed to give the same result. Consider this example:
exdf<-data.frame(a=c(1,NA,5),b=c(3,2,2))
# a b
#1 1 3
#2 NA 2
#3 5 2
colMeans(exdf,na.rm=TRUE)
# a b
#3.000000 2.333333
colMeans(na.omit(exdf))
# a b
#3.0 2.5
Why is this? In the first case, the mean of column b
is calculated through (3+2+2)/3
. In the second case, the second row is removed in its entirety (also the value of b
which is not-NA and therefore considered in the first case) by na.omit
and so the b
mean is just (3+2)/2
.