If I have a list like this:
results=[-14.82381293, -0.29423447, -13.56067979, -1.6288903, -0.31632439,
0.53459687, -1.34069996, -1.61042692, -4.03220519, -0.24332097]
I want to calculate the variance of this list in Python which is the average of the squared differences from the mean.
How can I go about this? Accessing the elements in the list to do the computations is confusing me for getting the square differences.
You can use numpy's built-in function var
:
import numpy as np
results = [-14.82381293, -0.29423447, -13.56067979, -1.6288903, -0.31632439,
0.53459687, -1.34069996, -1.61042692, -4.03220519, -0.24332097]
print(np.var(results))
This gives you 28.822364260579157
If - for whatever reason - you cannot use numpy
and/or you don't want to use a built-in function for it, you can also calculate it "by hand" using e.g. a list comprehension:
# calculate mean
m = sum(results) / len(results)
# calculate variance using a list comprehension
var_res = sum((xi - m) ** 2 for xi in results) / len(results)
which gives you the identical result.
If you are interested in the standard deviation, you can use numpy.std:
print(np.std(results))
5.36864640860051
@Serge Ballesta explained very well the difference between variance n
and n-1
. In numpy you can easily set this parameter using the option ddof
; its default is 0
, so for the n-1
case you can simply do:
np.var(results, ddof=1)
The "by hand" solution is given in @Serge Ballesta's answer.
Both approaches yield 32.024849178421285
.
You can set the parameter also for std
:
np.std(results, ddof=1)
5.659050201086865
Well, there are two ways for defining the variance. You have the variance n that you use when you have a full set, and the variance n-1 that you use when you have a sample.
The difference between the 2 is whether the value m = sum(xi) / n
is the real average or whether it is just an approximation of what the average should be.
Example1 : you want to know the average height of the students in a class and its variance : ok, the value m = sum(xi) / n
is the real average, and the formulas given by Cleb are ok (variance n).
Example2 : you want to know the average hour at which a bus passes at the bus stop and its variance. You note the hour for a month, and get 30 values. Here the value m = sum(xi) / n
is only an approximation of the real average, and that approximation will be more accurate with more values. In that case the best approximation for the actual variance is the variance n-1
varRes = sum([(xi - m)**2 for xi in results]) / (len(results) -1)
Ok, it has nothing to do with Python, but it does have an impact on statistical analysis, and the question is tagged statistics and variance
Note: ordinarily, statistical libraries like numpy use the variance n for what they call var
or variance
, and the variance n-1 for the function that gives the standard deviation.
Starting Python 3.4
, the standard library comes with the variance
function (sample variance or variance n-1) as part of the statistics
module:
from statistics import variance
# data = [-14.82381293, -0.29423447, -13.56067979, -1.6288903, -0.31632439, 0.53459687, -1.34069996, -1.61042692, -4.03220519, -0.24332097]
variance(data)
# 32.024849178421285
The population variance (or variance n) can be obtained using the pvariance
function:
from statistics import pvariance
# data = [-14.82381293, -0.29423447, -13.56067979, -1.6288903, -0.31632439, 0.53459687, -1.34069996, -1.61042692, -4.03220519, -0.24332097]
pvariance(data)
# 28.822364260579157
Also note that if you already know the mean of your list, the variance
and pvariance
functions take a second argument (respectively xbar
and mu
) in order to spare recomputing the mean of the sample (which is part of the variance computation).
Numpy is indeed the most elegant and fast way to do it.
I think the actual question was about how to access the individual elements of a list to do such a calculation yourself, so below an example:
results=[-14.82381293, -0.29423447, -13.56067979, -1.6288903, -0.31632439,
0.53459687, -1.34069996, -1.61042692, -4.03220519, -0.24332097]
import numpy as np
print 'numpy variance: ', np.var(results)
# without numpy by hand
# there are two ways of calculating the variance
# - 1. direct as central 2nd order moment (https://en.wikipedia.org/wiki/Moment_(mathematics))divided by the length of the vector
# - 2. "mean of square minus square of mean" (see https://en.wikipedia.org/wiki/Variance)
# calculate mean
n= len(results)
sum=0
for i in range(n):
sum = sum+ results[i]
mean=sum/n
print 'mean: ', mean
# calculate the central moment
sum2=0
for i in range(n):
sum2=sum2+ (results[i]-mean)**2
myvar1=sum2/n
print "my variance1: ", myvar1
# calculate the mean of square minus square of mean
sum3=0
for i in range(n):
sum3=sum3+ results[i]**2
myvar2 = sum3/n - mean**2
print "my variance2: ", myvar2
gives you:
numpy variance: 28.8223642606
mean: -3.731599805
my variance1: 28.8223642606
my variance2: 28.8223642606
The correct answer is to use one of the packages like NumPy, but if you want to roll your own, and you want to do incrementally, there is a good algorithm that has higher accuracy. See this link https://www.johndcook.com/blog/standard_deviation/
I ported my perl implementation to Python. Please point out issues in the comments.
Mklast = 0
Mk = 0
Sk = 0
k = 0
for xi in results:
k = k +1
Mk = Mklast + (xi - Mklast) / k
Sk = Sk + (xi - Mklast) * ( xi - Mk)
Mklast = Mk
var = Sk / (k -1)
print var
Answer is
>>> print var
32.0248491784
import numpy as np
def get_variance(xs):
mean = np.mean(xs)
summed = 0
for x in xs:
summed += (x - mean)**2
return summed / (len(xs) - 1)
print(get_variance([1,2,3,4,5]))
out 2.5
a = [1,2,3,4,5]
variance = np.var(a, ddof=1)
print(variance)
来源:https://stackoverflow.com/questions/35583302/how-can-i-calculate-the-variance-of-a-list-in-python