Python: Sliding windowed mean, ignoring missing data

前端 未结 2 964
抹茶落季
抹茶落季 2020-12-19 13:13

I am currently trying to process an experimental timeseries dataset, which has missing values. I would like to calculate the sliding windowed mean of this dataset along time

相关标签:
2条回答
  • 2020-12-19 13:58

    Here's a convolution based approach using np.convolve -

    mask = np.isnan(data)
    K = np.ones(win_size,dtype=int)
    out = np.convolve(np.where(mask,0,data), K)/np.convolve(~mask,K)
    

    Please note that this would have one extra element on either sides.

    If you are working with 2D data, we can use Scipy's 2D convolution.

    Approaches -

    def original_app(data, win_size):
        #Compute mean
        result = np.zeros(data.size)
        for count in range(data.size):
            part_data = data[max(count - (win_size - 1) / 2, 0): \
                     min(count + (win_size + 1) / 2, data.size)]
            mask = np.isfinite(part_data)
            if np.sum(mask) != 0:
                result[count] = np.sum(part_data[mask]) / np.sum(mask)
            else:
                result[count] = None
        return result
    
    def numpy_app(data, win_size):     
        mask = np.isnan(data)
        K = np.ones(win_size,dtype=int)
        out = np.convolve(np.where(mask,0,data), K)/np.convolve(~mask,K)
        return out[1:-1]  # Slice out the one-extra elems on sides
    

    Sample run -

    In [118]: #Construct sample data
         ...: n = 50
         ...: n_miss = 20
         ...: win_size = 3
         ...: data= np.random.random(50)
         ...: data[np.random.randint(0,n-1, n_miss)] = np.nan
         ...: 
    
    In [119]: original_app(data, win_size = 3)
    Out[119]: 
    array([ 0.88356487,  0.86829731,  0.85249541,  0.83776219,         nan,
                   nan,  0.61054015,  0.63111926,  0.63111926,  0.65169837,
            0.1857301 ,  0.58335324,  0.42088104,  0.5384565 ,  0.31027752,
            0.40768907,  0.3478563 ,  0.34089655,  0.55462903,  0.71784816,
            0.93195716,         nan,  0.41635575,  0.52211653,  0.65053379,
            0.76762282,  0.72888574,  0.35250449,  0.35250449,  0.14500637,
            0.06997668,  0.22582318,  0.18621848,  0.36320784,  0.19926647,
            0.24506199,  0.09983572,  0.47595439,  0.79792941,  0.5982114 ,
            0.42389375,  0.28944089,  0.36246113,  0.48088139,  0.71105449,
            0.60234163,  0.40012839,  0.45100475,  0.41768466,  0.41768466])
    
    In [120]: numpy_app(data, win_size = 3)
    __main__:36: RuntimeWarning: invalid value encountered in divide
    Out[120]: 
    array([ 0.88356487,  0.86829731,  0.85249541,  0.83776219,         nan,
                   nan,  0.61054015,  0.63111926,  0.63111926,  0.65169837,
            0.1857301 ,  0.58335324,  0.42088104,  0.5384565 ,  0.31027752,
            0.40768907,  0.3478563 ,  0.34089655,  0.55462903,  0.71784816,
            0.93195716,         nan,  0.41635575,  0.52211653,  0.65053379,
            0.76762282,  0.72888574,  0.35250449,  0.35250449,  0.14500637,
            0.06997668,  0.22582318,  0.18621848,  0.36320784,  0.19926647,
            0.24506199,  0.09983572,  0.47595439,  0.79792941,  0.5982114 ,
            0.42389375,  0.28944089,  0.36246113,  0.48088139,  0.71105449,
            0.60234163,  0.40012839,  0.45100475,  0.41768466,  0.41768466])
    

    Runtime test -

    In [122]: #Construct sample data
         ...: n = 50000
         ...: n_miss = 20000
         ...: win_size = 3
         ...: data= np.random.random(n)
         ...: data[np.random.randint(0,n-1, n_miss)] = np.nan
         ...: 
    
    In [123]: %timeit original_app(data, win_size = 3)
    1 loops, best of 3: 1.51 s per loop
    
    In [124]: %timeit numpy_app(data, win_size = 3)
    1000 loops, best of 3: 1.09 ms per loop
    
    In [125]: import pandas as pd
    
    # @jdehesa's pandas solution
    In [126]: %timeit pd.Series(data).rolling(window=3, min_periods=1).mean()
    100 loops, best of 3: 3.34 ms per loop
    
    0 讨论(0)
  • 2020-12-19 14:17

    You can do that using the rolling function of Pandas:

    import numpy as np
    import pandas as pd
    
    #Construct sample data
    n = 50
    n_miss = 20
    win_size = 3
    data = np.random.random(n)
    data[np.random.randint(0, n-1, n_miss)] = None
    
    windowed_mean = pd.Series(data).rolling(window=win_size, min_periods=1).mean()
    
    print(pd.DataFrame({'Data': data, 'Windowed mean': windowed_mean}) )
    

    Output:

            Data  Windowed mean
    0   0.589376       0.589376
    1   0.639173       0.614274
    2   0.343534       0.524027
    3   0.250329       0.411012
    4   0.911952       0.501938
    5        NaN       0.581141
    6   0.224964       0.568458
    7        NaN       0.224964
    8   0.508419       0.366692
    9   0.215418       0.361918
    10       NaN       0.361918
    11  0.638118       0.426768
    12  0.587478       0.612798
    13  0.097037       0.440878
    14  0.688689       0.457735
    15  0.858593       0.548107
    16  0.408903       0.652062
    17  0.448993       0.572163
    18       NaN       0.428948
    19  0.877453       0.663223
    20       NaN       0.877453
    21       NaN       0.877453
    22  0.021798       0.021798
    23  0.482054       0.251926
    24  0.092387       0.198746
    25  0.251766       0.275402
    26  0.093854       0.146002
    27       NaN       0.172810
    28       NaN       0.093854
    29       NaN            NaN
    30  0.965669       0.965669
    31  0.695999       0.830834
    32       NaN       0.830834
    33       NaN       0.695999
    34       NaN            NaN
    35  0.613727       0.613727
    36  0.837533       0.725630
    37       NaN       0.725630
    38  0.782295       0.809914
    39       NaN       0.782295
    40  0.777429       0.779862
    41  0.401355       0.589392
    42  0.491709       0.556831
    43  0.127813       0.340292
    44  0.781625       0.467049
    45  0.960466       0.623301
    46  0.637618       0.793236
    47  0.651264       0.749782
    48  0.154911       0.481264
    49  0.159145       0.321773
    
    0 讨论(0)
提交回复
热议问题