Memory consumption of NumPy function for standard deviation

后端 未结 2 1393
误落风尘
误落风尘 2021-01-11 11:39

I\'m currently using the Python bindings of GDAL to work on quite large raster data sets (> 4 GB). Since loading them into memory at once is no feasible solution for me I re

相关标签:
2条回答
  • 2021-01-11 11:55

    I doubt you will find any such functions in numpy. The raison d'être of numpy is that it takes advantage of vector processor instruction sets -- performing the same instruction of large amounts of data. Basically numpy trades memory efficiency for speed efficiency. However, due to the memory intensive nature of Python, numpy is also able to achieve certain memory efficiencies by associating the data type with the array as a whole and not each individual element.

    One way to improve the speed, but still sacrifice some memory overhead is calculate the standard deviation in chunks eg.

    import numpy as np
    
    def std(arr, blocksize=1000000):
        """Written for py3, change range to xrange for py2.
        This implementation requires the entire array in memory, but it shows how you can
        calculate the standard deviation in a piecemeal way.
        """
        num_blocks, remainder = divmod(len(arr), blocksize)
        mean = arr.mean()
        tmp = np.empty(blocksize, dtype=float)
        total_squares = 0
        for start in range(0, blocksize*num_blocks, blocksize):
            # get a view of the data we want -- views do not "own" the data they point to
            # -- they have minimal memory overhead
            view = arr[start:start+blocksize]
            # # inplace operations prevent a new array from being created
            np.subtract(view, mean, out=tmp)
            tmp *= tmp
            total_squares += tmp.sum()
        if remainder:
            # len(arr) % blocksize != 0 and need process last part of array
            # create copy of view, with the smallest amount of new memory allocation possible
            # -- one more array *view*
            view = arr[-remainder:]
            tmp = tmp[-remainder:]
            np.subtract(view, mean, out=tmp)
            tmp *= tmp
            total_squares += tmp.sum()
    
        var = total_squares / len(arr)
        sd = var ** 0.5
        return sd
    
    a = np.arange(20e6)
    assert np.isclose(np.std(a), std(a))
    

    Showing the speed up --- the larger the blocksize, the larger the speed up. And considerably lower memory overhead. Not entirely the lower memory overhead is 100% accurate.

    In [70]: %timeit np.std(a)
    10 loops, best of 3: 105 ms per loop
    
    In [71]: %timeit std(a, blocksize=4096)
    10 loops, best of 3: 160 ms per loop
    
    In [72]: %timeit std(a, blocksize=1000000)
    10 loops, best of 3: 105 ms per loop
    
    In [73]: %memit std(a, blocksize=4096)
    peak memory: 360.11 MiB, increment: 0.00 MiB
    
    In [74]: %memit std(a, blocksize=1000000)
    peak memory: 360.11 MiB, increment: 0.00 MiB
    
    In [75]: %memit np.std(a)
    peak memory: 512.70 MiB, increment: 152.59 MiB
    
    0 讨论(0)
  • 2021-01-11 12:10

    Cython to the rescue! This achieves a nice speed up:

    %%cython
    cimport cython
    cimport numpy as np
    from libc.math cimport sqrt
    
    @cython.boundscheck(False)
    def std_welford(np.ndarray[np.float64_t, ndim=1] a):
        cdef int n = 0
        cdef float mean = 0
        cdef float M2 = 0
        cdef int a_len = len(a)
        cdef int i
        cdef float delta
        cdef float result
        for i in range(a_len):
            n += 1
            delta = a[i] - mean
            mean += delta / n
            M2 += delta * (a[i] - mean)
        if n < 2:
            result = np.nan
            return result
        else:
            result = sqrt(M2 / (n - 1))
            return result
    

    Using this to test:

    a = np.random.rand(10000).astype(np.float)
    print std_welford(a)
    %timeit -n 10 -r 10 std_welford(a)
    

    Cython code

    0.288327455521
    10 loops, best of 10: 59.6 µs per loop
    

    Original code

    0.289605617397
    10 loops, best of 10: 18.5 ms per loop
    

    Numpy std

    0.289493223504
    10 loops, best of 10: 29.3 µs per loop
    

    So a speed increase of around 300x. Still not as good as the numpy version..

    0 讨论(0)
提交回复
热议问题