Why is the var() function giving me a different answer than my calculated variance?

问题

I wasn't sure if this should go in SO or some other .SE, so I will delete if this is deemed to be off-topic.

I have a vector and I'm trying to calculate the variance "by hand" (meaning based on the definition of variance but still performing the calculations in R) using the equation: V[X] = E[X^2] - E[X]^2 where E[X] = sum (x * f(x)) and E[X^2] = sum (x^2 * f(x))

However, my calculated variance is different from the var() function that R has (which I was using to check my work). Why is the var() function different? How is it calculating variance? I've checked my calculations several times so I'm fairly confident in the value I calculated. My code is provided below.

vec <- c(3, 5, 4, 3, 6, 7, 3, 6, 4, 6, 3, 4, 1, 3, 4, 4)
range(vec)
counts <- hist(vec + .01, breaks = 7)$counts
fx <- counts / (sum(counts)) #the pmf f(x)
x <- c(min(vec): max(vec)) #the values of x
exp <- sum(x * fx) ; exp #expected value of x
exp.square <- sum(x^2 * fx) #expected value of x^2
var <- exp.square - (exp)^2 ; var #calculated variance
var(vec)

This gives me a calculated variance of 2.234 but the var() function says the variance is 2.383.

回答1:

While V[X] = E[X^2] - E[X]^2 is the population variance (when the values in the vector are the whole population, not just a sample), the var function calculates an estimator for the population variance (the sample variance).

回答2:

While this has been answered already, I fear some may still be confused between population variance and its estimate from a sample, and this may be due to the example.

If the vector vec represents the full population, then vec is simply a way to represent the distribution function, which can be summarized more succinctly in the pmf that you derived from it. Crucially, the elements of vec in this case are not random variables. In this case, your computations of E[X] and var[X] from the pmf are correct.

Most of the time, however, when you have data (for instance in the form of a vector) it is a random sample from the underlying population. Each element of the vector is the observed value of a random variable: it is a "draw" from the population. For this example, it is fair to assume that each element is drawn independently, from the same distribution ("iid"). In practice, this random sampling means that you cannot compute the true pmf, as you may have some variations due merely to chance. Likewise, you can't get the true value of E[X], E[X^2], and thus Var[X], from the sample. These values need to be estimated. The sample average is usually a good estimate for E[X] (in particular, it is unbiased), but it turns out that the sample variance is a biased estimate for the population variance. To correct for this bias, you need to multiply it by the factor n/(n-1).

As this latter case is the most seen in practice (aside from textbook exercises), it is what is computed when you call the var() function in R. So if you're asked to find the "estimated variance", it most likely implies that your vector vec is a random sample and that you fall in this latter case. If this was the original question, then you have your answer, and I hope it becomes clear that the choice of the name of variables and the commenting in your code can lead to confusion: indeed, you cannot compute the pmf, the expected value or the variance of the population from a random sample: what you can get are their estimates, and one of them -- that of the variance -- is biased.

I wanted to clarify this, as this confusion, as seen in the coding, is very common when first being acquainted with these concepts. In particular, the accepted answer may be misleading: V[X] = E[X^2] - E[X]^2 is not the sample variance; it is indeed the population variance, which you cannot get from the random sample. If you replace the values in this equation by their sample estimate (as averages), you will get sample.V[X] = average[X^2] - average[X]^2, which is the sample variance, and is biased.

Some may say that I am picky on the semantics; however, the "abuse of notation" in the accepted answer is only acceptable when everybody recognizes it as such. However, for those trying to figure out these conceptual differences, I believe it is best to remain precise.

回答3:

Here's one way to calculate "estimated population variance" that matches the output of the var function in the stats package:

vec <- c(3, 5, 4, 3, 6, 7, 3, 6, 4, 6, 3, 4, 1, 3, 4, 4)
n <- length(vec)
average <- mean(vec)
differences <- vec - average
squared.differences <- differences^2
sum.of.squared.differences <-  sum(squared.differences)
estimator <- 1/(n - 1)
estimated.variance <- estimator * sum.of.squared.differences
estimated.variance
[1] 2.383333
var(vec) == estimated.variance # The "hand calculated" variance equals the variance in the stats package.
[1] TRUE

I wonder what folks think about labelling the term "estimator."

In a function (that's unlikely to handle errors and anomalies as well as the var function in the stats package):

estimated.variance.by.hand <- function (x){
  n <- length(x)
  average <- mean(x)
  differences <- x - average
  squared.differences <- differences^2
  sum.of.squared.differences <-  sum(squared.differences)
  estimator <- 1/(n - 1)
  est.variance <- estimator * sum.of.squared.differences
  est.variance
}
estimated.variance.by.hand(vec)
estimated.variance.by.hand(1:10)
var(1:10)
estimated.variance.by.hand(1:100)
var(1:100)

回答4:

The R-base var() takes N-1 in the denominator, to get a more reliable (less biased) estimator of the variance. Unfortunely there is no option to tell var() to take N instead, so I wrote my own variance function for that case.

var_N = function(x){var(x)*(length(x)-1)/length(x)}

And some code to illustrate the function above, the base function, the manual way and @dca's estimated.variance.by.hand() function:

## Data
x = c(4,5,6,7,8,2,4,6,6)
mean_x = mean(x)


## Variance with N-1 in denominator
var(x)
sum((x - mean_x) ^2) / (length(x) - 1)
estimated.variance.by.hand(x)


## Variance with N in denominator
sum((x - mean_x) ^2) / length(x)
var(x) * (length(x) - 1) / length(x)
var_N = function(x){var(x)*(length(x)-1)/length(x)}
var_N(x)

来源：https://stackoverflow.com/questions/28637908/why-is-the-var-function-giving-me-a-different-answer-than-my-calculated-varian

标签

variance