How can I calculate the variance of a list in python?

前端 未结 7 821
醉酒成梦
醉酒成梦 2020-12-29 04:58

If I have a list like this:

results=[-14.82381293, -0.29423447, -13.56067979, -1.6288903, -0.31632439,
          0.53459687, -1.34069996, -1.61042692, -4.032         


        
相关标签:
7条回答
  • 2020-12-29 05:02

    You can use numpy's built-in function var:

    import numpy as np
    
    results = [-14.82381293, -0.29423447, -13.56067979, -1.6288903, -0.31632439,
              0.53459687, -1.34069996, -1.61042692, -4.03220519, -0.24332097]
    
    print(np.var(results))
    

    This gives you 28.822364260579157

    If - for whatever reason - you cannot use numpy and/or you don't want to use a built-in function for it, you can also calculate it "by hand" using e.g. a list comprehension:

    # calculate mean
    m = sum(results) / len(results)
    
    # calculate variance using a list comprehension
    var_res = sum((xi - m) ** 2 for xi in results) / len(results)
    

    which gives you the identical result.

    If you are interested in the standard deviation, you can use numpy.std:

    print(np.std(results))
    5.36864640860051
    

    @Serge Ballesta explained very well the difference between variance n and n-1. In numpy you can easily set this parameter using the option ddof; its default is 0, so for the n-1 case you can simply do:

    np.var(results, ddof=1)
    

    The "by hand" solution is given in @Serge Ballesta's answer.

    Both approaches yield 32.024849178421285.

    You can set the parameter also for std:

    np.std(results, ddof=1)
    5.659050201086865
    
    0 讨论(0)
  • 2020-12-29 05:10

    Well, there are two ways for defining the variance. You have the variance n that you use when you have a full set, and the variance n-1 that you use when you have a sample.

    The difference between the 2 is whether the value m = sum(xi) / n is the real average or whether it is just an approximation of what the average should be.

    Example1 : you want to know the average height of the students in a class and its variance : ok, the value m = sum(xi) / n is the real average, and the formulas given by Cleb are ok (variance n).

    Example2 : you want to know the average hour at which a bus passes at the bus stop and its variance. You note the hour for a month, and get 30 values. Here the value m = sum(xi) / n is only an approximation of the real average, and that approximation will be more accurate with more values. In that case the best approximation for the actual variance is the variance n-1

    varRes = sum([(xi - m)**2 for xi in results]) / (len(results) -1)
    

    Ok, it has nothing to do with Python, but it does have an impact on statistical analysis, and the question is tagged statistics and variance

    Note: ordinarily, statistical libraries like numpy use the variance n for what they call var or variance, and the variance n-1 for the function that gives the standard deviation.

    0 讨论(0)
  • 2020-12-29 05:11

    Starting Python 3.4, the standard library comes with the variance function (sample variance or variance n-1) as part of the statistics module:

    from statistics import variance
    # data = [-14.82381293, -0.29423447, -13.56067979, -1.6288903, -0.31632439, 0.53459687, -1.34069996, -1.61042692, -4.03220519, -0.24332097]
    variance(data)
    # 32.024849178421285
    

    The population variance (or variance n) can be obtained using the pvariance function:

    from statistics import pvariance
    # data = [-14.82381293, -0.29423447, -13.56067979, -1.6288903, -0.31632439, 0.53459687, -1.34069996, -1.61042692, -4.03220519, -0.24332097]
    pvariance(data)
    # 28.822364260579157
    

    Also note that if you already know the mean of your list, the variance and pvariance functions take a second argument (respectively xbar and mu) in order to spare recomputing the mean of the sample (which is part of the variance computation).

    0 讨论(0)
  • 2020-12-29 05:12

    Numpy is indeed the most elegant and fast way to do it.

    I think the actual question was about how to access the individual elements of a list to do such a calculation yourself, so below an example:

    results=[-14.82381293, -0.29423447, -13.56067979, -1.6288903, -0.31632439,
          0.53459687, -1.34069996, -1.61042692, -4.03220519, -0.24332097]
    
    import numpy as np
    print 'numpy variance: ', np.var(results)
    
    
    # without numpy by hand  
    
    # there are two ways of calculating the variance 
    #   - 1. direct as central 2nd order moment (https://en.wikipedia.org/wiki/Moment_(mathematics))divided by the length of the vector
    #   - 2. "mean of square minus square of mean" (see https://en.wikipedia.org/wiki/Variance)
    
    # calculate mean
    n= len(results)
    sum=0
    for i in range(n):
        sum = sum+ results[i]
    
    
    mean=sum/n
    print 'mean: ', mean
    
    #  calculate the central moment
    sum2=0
    for i in range(n):
        sum2=sum2+ (results[i]-mean)**2
    
    myvar1=sum2/n
    print "my variance1: ", myvar1
    
    # calculate the mean of square minus square of mean
    sum3=0
    for i in range(n):
        sum3=sum3+ results[i]**2
    
    myvar2 = sum3/n - mean**2
    print "my variance2: ", myvar2
    

    gives you:

    numpy variance:  28.8223642606
    mean:  -3.731599805
    my variance1:  28.8223642606
    my variance2:  28.8223642606
    
    0 讨论(0)
  • 2020-12-29 05:12

    Without imports, I would use the following python3 script:

    #!/usr/bin/env python3
    
    def createData():
        data1=[12,54,60,3,15,6,36]
        data2=[1,2,3,4,5]
        data3=[100,30000,1567,3467,20000,23457,400,1,15]
    
        dataset=[]
        dataset.append(data1)
        dataset.append(data2)
        dataset.append(data3)
    
        return dataset
    
    def calculateMean(data):
        means=[]
        # one list of the nested list
        for oneDataset in data:
            sum=0
            mean=0
            # one datapoint in one inner list
            for number in oneDataset:
                # summing up
                sum+=number
            # mean for one inner list
            mean=sum/len(oneDataset)
            # adding a tuples of the original data and their mean to
            # a list of tuples
            item=(oneDataset, mean)
            means.append(item)
    
        return means
    
    # to do: substract mean from each element and square the result
    # sum up the square results and divide by number of elements
    def calculateVariance(meanData):
        variances=[]
        # meanData is the list of tuples
        # pair is one tuple
        for pair in meanData:
            # pair[0] is the original data
            interResult=0
            squareSum=0
            for element in pair[0]:
                interResult=(element-pair[1])**2
                squareSum+=interResult
            variance=squareSum/len(pair[0])
            variances.append((pair[0], pair[1], variance))
    
        return variances
    
    
    
    
    
    def main():
        my_data=createData()
        my_means=calculateMean(my_data)
        my_variances=calculateVariance(my_means)
        print(my_variances)
    
    if __name__ == "__main__":
        main()
    

    here you get a print of the original data, their mean and the variance. I know this approach covers a list of several datasets, yet I think you can adapt it quickly for your purpose ;)

    0 讨论(0)
  • 2020-12-29 05:21
    import numpy as np
    def get_variance(xs):
        mean = np.mean(xs)
        summed = 0
        for x in xs:
            summed += (x - mean)**2
        return summed / (len(xs))
    print(get_variance([1,2,3,4,5]))
    

    out 2.0

    a = [1,2,3,4,5]
    variance = np.var(a, ddof=1)
    print(variance)
    
    0 讨论(0)
提交回复
热议问题