calculating mean and standard deviation of the data which does not fit in memory using python [duplicate]

爷,独闯天下 提交于 2021-02-07 03:38:15

问题


I have a lot of data stored at disk in large arrays. I cant load everything in memory altogether.

How one could calculate the mean and the standard deviation?


回答1:


There is a simple online algorithm that computes both the mean and the variance by looking at each datapoint once and using O(1) memory.

Wikipedia offers the following code:

def online_variance(data):
    n = 0
    mean = 0
    M2 = 0

    for x in data:
        n = n + 1
        delta = x - mean
        mean = mean + delta/n
        M2 = M2 + delta*(x - mean)

    variance = M2/(n - 1)
    return variance

This algorithm is also known as Welford's method. Unlike the method suggested in the other answer, it can be shown to have nice numerical properties.

Take the square root of the variance to get the standard deviation.




回答2:


Sounds like a math question. For the mean, you know that you can take the mean of a chunk of data, and then take the mean of the means. If the chunks aren't the same size, you'll have to take a weighted average.

For the standard deviation, you'll have to calculate the variance first. I'd suggest doing this alongside the calculation of the mean. For variance, you have

Var(X) = Avg(X^2) - Avg(X)^2

So compute the average of your data, and the average of your (data^2). Aggregate them as above, and the take the difference.

Then the standard deviation is just the square root of the variance.

Note that you could do the whole thing with iterators, which is probably the most efficient.



来源:https://stackoverflow.com/questions/15638612/calculating-mean-and-standard-deviation-of-the-data-which-does-not-fit-in-memory

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!