If I have a list like this:
results=[-14.82381293, -0.29423447, -13.56067979, -1.6288903, -0.31632439,
0.53459687, -1.34069996, -1.61042692, -4.032
You can use numpy's built-in function var:
import numpy as np
results = [-14.82381293, -0.29423447, -13.56067979, -1.6288903, -0.31632439,
0.53459687, -1.34069996, -1.61042692, -4.03220519, -0.24332097]
print(np.var(results))
This gives you 28.822364260579157
If - for whatever reason - you cannot use numpy
and/or you don't want to use a built-in function for it, you can also calculate it "by hand" using e.g. a list comprehension:
# calculate mean
m = sum(results) / len(results)
# calculate variance using a list comprehension
var_res = sum((xi - m) ** 2 for xi in results) / len(results)
which gives you the identical result.
If you are interested in the standard deviation, you can use numpy.std:
print(np.std(results))
5.36864640860051
@Serge Ballesta explained very well the difference between variance n
and n-1
. In numpy you can easily set this parameter using the option ddof
; its default is 0
, so for the n-1
case you can simply do:
np.var(results, ddof=1)
The "by hand" solution is given in @Serge Ballesta's answer.
Both approaches yield 32.024849178421285
.
You can set the parameter also for std
:
np.std(results, ddof=1)
5.659050201086865
Well, there are two ways for defining the variance. You have the variance n that you use when you have a full set, and the variance n-1 that you use when you have a sample.
The difference between the 2 is whether the value m = sum(xi) / n
is the real average or whether it is just an approximation of what the average should be.
Example1 : you want to know the average height of the students in a class and its variance : ok, the value m = sum(xi) / n
is the real average, and the formulas given by Cleb are ok (variance n).
Example2 : you want to know the average hour at which a bus passes at the bus stop and its variance. You note the hour for a month, and get 30 values. Here the value m = sum(xi) / n
is only an approximation of the real average, and that approximation will be more accurate with more values. In that case the best approximation for the actual variance is the variance n-1
varRes = sum([(xi - m)**2 for xi in results]) / (len(results) -1)
Ok, it has nothing to do with Python, but it does have an impact on statistical analysis, and the question is tagged statistics and variance
Note: ordinarily, statistical libraries like numpy use the variance n for what they call var
or variance
, and the variance n-1 for the function that gives the standard deviation.
Starting Python 3.4
, the standard library comes with the variance function (sample variance or variance n-1) as part of the statistics module:
from statistics import variance
# data = [-14.82381293, -0.29423447, -13.56067979, -1.6288903, -0.31632439, 0.53459687, -1.34069996, -1.61042692, -4.03220519, -0.24332097]
variance(data)
# 32.024849178421285
The population variance (or variance n) can be obtained using the pvariance function:
from statistics import pvariance
# data = [-14.82381293, -0.29423447, -13.56067979, -1.6288903, -0.31632439, 0.53459687, -1.34069996, -1.61042692, -4.03220519, -0.24332097]
pvariance(data)
# 28.822364260579157
Also note that if you already know the mean of your list, the variance
and pvariance
functions take a second argument (respectively xbar
and mu
) in order to spare recomputing the mean of the sample (which is part of the variance computation).
Numpy is indeed the most elegant and fast way to do it.
I think the actual question was about how to access the individual elements of a list to do such a calculation yourself, so below an example:
results=[-14.82381293, -0.29423447, -13.56067979, -1.6288903, -0.31632439,
0.53459687, -1.34069996, -1.61042692, -4.03220519, -0.24332097]
import numpy as np
print 'numpy variance: ', np.var(results)
# without numpy by hand
# there are two ways of calculating the variance
# - 1. direct as central 2nd order moment (https://en.wikipedia.org/wiki/Moment_(mathematics))divided by the length of the vector
# - 2. "mean of square minus square of mean" (see https://en.wikipedia.org/wiki/Variance)
# calculate mean
n= len(results)
sum=0
for i in range(n):
sum = sum+ results[i]
mean=sum/n
print 'mean: ', mean
# calculate the central moment
sum2=0
for i in range(n):
sum2=sum2+ (results[i]-mean)**2
myvar1=sum2/n
print "my variance1: ", myvar1
# calculate the mean of square minus square of mean
sum3=0
for i in range(n):
sum3=sum3+ results[i]**2
myvar2 = sum3/n - mean**2
print "my variance2: ", myvar2
gives you:
numpy variance: 28.8223642606
mean: -3.731599805
my variance1: 28.8223642606
my variance2: 28.8223642606
Without imports, I would use the following python3 script:
#!/usr/bin/env python3
def createData():
data1=[12,54,60,3,15,6,36]
data2=[1,2,3,4,5]
data3=[100,30000,1567,3467,20000,23457,400,1,15]
dataset=[]
dataset.append(data1)
dataset.append(data2)
dataset.append(data3)
return dataset
def calculateMean(data):
means=[]
# one list of the nested list
for oneDataset in data:
sum=0
mean=0
# one datapoint in one inner list
for number in oneDataset:
# summing up
sum+=number
# mean for one inner list
mean=sum/len(oneDataset)
# adding a tuples of the original data and their mean to
# a list of tuples
item=(oneDataset, mean)
means.append(item)
return means
# to do: substract mean from each element and square the result
# sum up the square results and divide by number of elements
def calculateVariance(meanData):
variances=[]
# meanData is the list of tuples
# pair is one tuple
for pair in meanData:
# pair[0] is the original data
interResult=0
squareSum=0
for element in pair[0]:
interResult=(element-pair[1])**2
squareSum+=interResult
variance=squareSum/len(pair[0])
variances.append((pair[0], pair[1], variance))
return variances
def main():
my_data=createData()
my_means=calculateMean(my_data)
my_variances=calculateVariance(my_means)
print(my_variances)
if __name__ == "__main__":
main()
here you get a print of the original data, their mean and the variance. I know this approach covers a list of several datasets, yet I think you can adapt it quickly for your purpose ;)
import numpy as np
def get_variance(xs):
mean = np.mean(xs)
summed = 0
for x in xs:
summed += (x - mean)**2
return summed / (len(xs))
print(get_variance([1,2,3,4,5]))
out 2.0
a = [1,2,3,4,5]
variance = np.var(a, ddof=1)
print(variance)