NumPy: calculate averages with NaNs removed

前端 未结 12 2427
慢半拍i
慢半拍i 2020-11-27 18:45

How can I calculate matrix mean values along a matrix, but to remove nan values from calculation? (For R people, think na.rm = TRUE).

Here

相关标签:
12条回答
  • 2020-11-27 19:26

    If performance matters, you should use bottleneck.nanmean() instead:

    http://pypi.python.org/pypi/Bottleneck

    0 讨论(0)
  • 2020-11-27 19:26

    Or you use laxarray, freshly uploaded, which is among other a wrapper for masked arrays.

    import laxarray as la
    la.array(dat).mean(axis=1)
    

    following JoshAdel's protocoll I get:

    Time: 0.048791  Ratio: 1.000000   
    Time: 0.062242  Ratio: 1.275689   # laxarray's one-liner
    

    So laxarray is marginally slower (would need to check why, maybe fixable), but much easier to use and allow labelling dimensions with strings.

    check out: https://github.com/perrette/laxarray

    EDIT: I have checked with another module, "la", larry, which beats all tests:

    import la
    la.larry(dat).mean(axis=1)
    
    By hand, Time: 0.049013 Ratio: 1.000000
    Larry,   Time: 0.005467 Ratio: 0.111540
    laxarray Time: 0.061751 Ratio: 1.259889
    

    Impressive !

    0 讨论(0)
  • 2020-11-27 19:29

    I think what you want is a masked array:

    dat = np.array([[1,2,3], [4,5,nan], [nan,6,nan], [nan,nan,nan]])
    mdat = np.ma.masked_array(dat,np.isnan(dat))
    mm = np.mean(mdat,axis=1)
    print mm.filled(np.nan) # the desired answer
    

    Edit: Combining all of the timing data

       from timeit import Timer
    
        setupstr="""
    import numpy as np
    from scipy.stats.stats import nanmean    
    dat = np.random.normal(size=(1000,1000))
    ii = np.ix_(np.random.randint(0,99,size=50),np.random.randint(0,99,size=50))
    dat[ii] = np.nan
    """  
    
        method1="""
    mdat = np.ma.masked_array(dat,np.isnan(dat))
    mm = np.mean(mdat,axis=1)
    mm.filled(np.nan)    
    """
    
        N = 2
        t1 = Timer(method1, setupstr).timeit(N)
        t2 = Timer("[np.mean([l for l in d if not np.isnan(l)]) for d in dat]", setupstr).timeit(N)
        t3 = Timer("np.array([r[np.isfinite(r)].mean() for r in dat])", setupstr).timeit(N)
        t4 = Timer("np.ma.masked_invalid(dat).mean(axis=1)", setupstr).timeit(N)
        t5 = Timer("nanmean(dat,axis=1)", setupstr).timeit(N)
    
        print 'Time: %f\tRatio: %f' % (t1,t1/t1 )
        print 'Time: %f\tRatio: %f' % (t2,t2/t1 )
        print 'Time: %f\tRatio: %f' % (t3,t3/t1 )
        print 'Time: %f\tRatio: %f' % (t4,t4/t1 )
        print 'Time: %f\tRatio: %f' % (t5,t5/t1 )
    

    Returns:

    Time: 0.045454  Ratio: 1.000000
    Time: 8.179479  Ratio: 179.950595
    Time: 0.060988  Ratio: 1.341755
    Time: 0.070955  Ratio: 1.561029
    Time: 0.065152  Ratio: 1.433364
    
    0 讨论(0)
  • 2020-11-27 19:31

    You can always find a workaround in something like:

    numpy.nansum(dat, axis=1) / numpy.sum(numpy.isfinite(dat), axis=1)
    

    Numpy 2.0's numpy.mean has a skipna option which should take care of that.

    0 讨论(0)
  • 2020-11-27 19:33

    A masked array with the nans filtered out can also be created on the fly:

    print np.ma.masked_invalid(dat).mean(1)
    
    0 讨论(0)
  • 2020-11-27 19:34

    This is built upon the solution suggested by JoshAdel.

    Define the following function:

    def nanmean(data, **args):
        return numpy.ma.filled(numpy.ma.masked_array(data,numpy.isnan(data)).mean(**args), fill_value=numpy.nan)
    

    Example use:

    data = [[0, 1, numpy.nan], [8, 5, 1]]
    data = numpy.array(data)
    print data
    print nanmean(data)
    print nanmean(data, axis=0)
    print nanmean(data, axis=1)
    

    Will print out:

    [[  0.   1.  nan]
     [  8.   5.   1.]]
    
    3.0
    
    [ 4.  3.  1.]
    
    [ 0.5         4.66666667]
    
    0 讨论(0)
提交回复
热议问题