Python: Numpy standard deviation error

前端 未结 3 549
轻奢々
轻奢々 2020-12-29 23:13

This is a simple test

import numpy as np
data = np.array([-1,0,1])
print data.std()

>> 0.816496580928

I don\'t understand how this r

相关标签:
3条回答
  • 2020-12-29 23:45

    The crux of this problem is that you need to divide by N (3), not N-1 (2). As Iarsmans pointed out, numpy will use the population variance, not the sample variance.

    So the real answer is sqrt(2/3) which is exactly that: 0.8164965...

    If you happen to be trying to deliberately use a different value (than the default of 0) for the degrees of freedom, use the keyword argument ddofwith a positive value other than 0:

    np.std(data, ddof=1)
    

    ... but doing so here would reintroduce your original problem as numpy will divide by N - ddof.

    0 讨论(0)
  • 2020-12-29 23:55

    When getting into NumPy from Matlab, you'll probably want to keep the docs for both handy. They're similar but often differ in small but important details. Basically, they calculate the standard deviation differently. I would strongly recommend checking the documentation for anything you use that calculates standard deviation, whether a pocket calculator or a programming language, since the default is not (sorry!) standardized.

    Numpy STD: http://docs.scipy.org/doc/numpy/reference/generated/numpy.std.html

    Matlab STD: http://www.mathworks.com/help/matlab/ref/std.html

    The Numpy docs for std are a bit opaque, IMHO, especially considering that NumPy docs are generally fairly clear. If you read far enough: The average squared deviation is normally calculated as x.sum() / N, where N = len(x). If, however, ddof is specified, the divisor N - ddof is used instead. In standard statistical practice, ddof=1 provides an unbiased estimator of the variance of the infinite population. (In english, default is pop std dev, set ddof=1 for sample std dev).

    OTOH, the Matlab docs make clear the difference that's tripping you up:

    There are two common textbook definitions for the standard deviation s of a data vector X. [equations omitted] n is the number of elements in the sample. The two forms of the equation differ only in n – 1 versus n in the divisor.

    So, by default, Matlab calculates the sample standard deviation (N-1 in the divisor, so bigger to compensate for the fact this is a sample) and Numpy calculates the population standard deviation (N in the divisor). You use the ddof parameter to switch to the sample standard, or any other denominator you want (which goes beyond my statistics knowledge).

    Lastly, it doesn't help on this problem, but you'll probably find this helpful at some point. http://wiki.scipy.org/NumPy_for_Matlab_Users

    0 讨论(0)
  • 2020-12-29 23:58

    It is worth reading the help page for the function/method before suggesting it is incorrect. The method does exactly what the doc-string says it should be doing, divides by 3, because By default ddofis zero.:

    In [3]: numpy.std?
    
    String form: <function std at 0x104222398>
    File:        /System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/numpy/core/fromnumeric.py
    Definition:  numpy.std(a, axis=None, dtype=None, out=None, ddof=0, keepdims=False)
    Docstring:
    Compute the standard deviation along the specified axis.
    
    ...
    
    ddof : int, optional
        Means Delta Degrees of Freedom.  The divisor used in calculations
        is ``N - ddof``, where ``N`` represents the number of elements.
        By default `ddof` is zero.
    
    0 讨论(0)
提交回复
热议问题