问题
I have recently started using the statistics module for python.
I've noticed that by default the variance() method returns the 'unbiased' variance or sample variance:
import statistics as st
from random import randint
def myVariance(data):
# finds the variance of a given set of numbers
xbar = st.mean(data)
return sum([(x - xbar)**2 for x in data])/len(data)
def myUnbiasedVariance(data):
# finds the 'unbiased' variance of a given set of numbers (divides by N-1)
xbar = st.mean(data)
return sum([(x - xbar)**2 for x in data])/(len(data)-1)
population = [randint(0, 1000) for i in range(0,100)]
print myVariance(population)
print myUnbiasedVariance(population)
print st.variance(population)
output:
81295.8011
82116.9708081
82116.9708081
This seems odd to me. I guess a lot of the time people are working with samples so they want a sample variance, but i would expect the default function to calculate a population variance. Does anyone know why this is?
回答1:
I would argue that almost all the time when people estimate the variance from data they work with a sample. And, by the definition of unbiased estimate, the expected value of the unbiased estimate of the variance equals the population variance.
In your code, you use random.randint(0, 1000)
, which samples from a discrete uniform distribution with 1001 possible values and variance 1000*1002/12 = 83500 (see, e.g., MathWorld). Here code that shows that, on average and when using samples as input, statistics.variance()
gets closer to the population variance than statistics.pvariance()
:
import statistics as st, random, numpy as np
var, pvar = [], []
for i in range(10000):
smpl = [random.randint(0, 1000) for j in range(10)]
var.append(st.variance(smpl))
pvar.append(st.pvariance(smpl))
print "mean variance(sample): %.1f" %np.mean(var)
print "mean pvariance(sample): %.1f" %np.mean(pvar)
print "pvariance(population): %.1f" %st.pvariance(range(1001))
Here sample output:
mean variance(sample): 83626.0
mean pvariance(sample): 75263.4
pvariance(population): 83500.0
回答2:
Here is another great post. I was wondering the exact same thing and the answer to this really cleared it up for me. Using np.var you can add an arg to it of "ddof=1" to return the unbiased estimator. Check it out:
What is the difference between numpy var() and statistics variance() in python?
print(np.var([1,2,3,4],ddof=1))
1.66666666667
来源:https://stackoverflow.com/questions/39162505/why-does-statistics-variance-use-unbiased-sample-variance-by-default