Python: Sliding windowed mean, ignoring missing data

前端 未结 2 963
抹茶落季
抹茶落季 2020-12-19 13:13

I am currently trying to process an experimental timeseries dataset, which has missing values. I would like to calculate the sliding windowed mean of this dataset along time

2条回答
  •  囚心锁ツ
    2020-12-19 13:58

    Here's a convolution based approach using np.convolve -

    mask = np.isnan(data)
    K = np.ones(win_size,dtype=int)
    out = np.convolve(np.where(mask,0,data), K)/np.convolve(~mask,K)
    

    Please note that this would have one extra element on either sides.

    If you are working with 2D data, we can use Scipy's 2D convolution.

    Approaches -

    def original_app(data, win_size):
        #Compute mean
        result = np.zeros(data.size)
        for count in range(data.size):
            part_data = data[max(count - (win_size - 1) / 2, 0): \
                     min(count + (win_size + 1) / 2, data.size)]
            mask = np.isfinite(part_data)
            if np.sum(mask) != 0:
                result[count] = np.sum(part_data[mask]) / np.sum(mask)
            else:
                result[count] = None
        return result
    
    def numpy_app(data, win_size):     
        mask = np.isnan(data)
        K = np.ones(win_size,dtype=int)
        out = np.convolve(np.where(mask,0,data), K)/np.convolve(~mask,K)
        return out[1:-1]  # Slice out the one-extra elems on sides
    

    Sample run -

    In [118]: #Construct sample data
         ...: n = 50
         ...: n_miss = 20
         ...: win_size = 3
         ...: data= np.random.random(50)
         ...: data[np.random.randint(0,n-1, n_miss)] = np.nan
         ...: 
    
    In [119]: original_app(data, win_size = 3)
    Out[119]: 
    array([ 0.88356487,  0.86829731,  0.85249541,  0.83776219,         nan,
                   nan,  0.61054015,  0.63111926,  0.63111926,  0.65169837,
            0.1857301 ,  0.58335324,  0.42088104,  0.5384565 ,  0.31027752,
            0.40768907,  0.3478563 ,  0.34089655,  0.55462903,  0.71784816,
            0.93195716,         nan,  0.41635575,  0.52211653,  0.65053379,
            0.76762282,  0.72888574,  0.35250449,  0.35250449,  0.14500637,
            0.06997668,  0.22582318,  0.18621848,  0.36320784,  0.19926647,
            0.24506199,  0.09983572,  0.47595439,  0.79792941,  0.5982114 ,
            0.42389375,  0.28944089,  0.36246113,  0.48088139,  0.71105449,
            0.60234163,  0.40012839,  0.45100475,  0.41768466,  0.41768466])
    
    In [120]: numpy_app(data, win_size = 3)
    __main__:36: RuntimeWarning: invalid value encountered in divide
    Out[120]: 
    array([ 0.88356487,  0.86829731,  0.85249541,  0.83776219,         nan,
                   nan,  0.61054015,  0.63111926,  0.63111926,  0.65169837,
            0.1857301 ,  0.58335324,  0.42088104,  0.5384565 ,  0.31027752,
            0.40768907,  0.3478563 ,  0.34089655,  0.55462903,  0.71784816,
            0.93195716,         nan,  0.41635575,  0.52211653,  0.65053379,
            0.76762282,  0.72888574,  0.35250449,  0.35250449,  0.14500637,
            0.06997668,  0.22582318,  0.18621848,  0.36320784,  0.19926647,
            0.24506199,  0.09983572,  0.47595439,  0.79792941,  0.5982114 ,
            0.42389375,  0.28944089,  0.36246113,  0.48088139,  0.71105449,
            0.60234163,  0.40012839,  0.45100475,  0.41768466,  0.41768466])
    

    Runtime test -

    In [122]: #Construct sample data
         ...: n = 50000
         ...: n_miss = 20000
         ...: win_size = 3
         ...: data= np.random.random(n)
         ...: data[np.random.randint(0,n-1, n_miss)] = np.nan
         ...: 
    
    In [123]: %timeit original_app(data, win_size = 3)
    1 loops, best of 3: 1.51 s per loop
    
    In [124]: %timeit numpy_app(data, win_size = 3)
    1000 loops, best of 3: 1.09 ms per loop
    
    In [125]: import pandas as pd
    
    # @jdehesa's pandas solution
    In [126]: %timeit pd.Series(data).rolling(window=3, min_periods=1).mean()
    100 loops, best of 3: 3.34 ms per loop
    

提交回复
热议问题