Most efficient way to forward-fill NaN values in numpy array

后端 未结 5 1003
南笙
南笙 2020-11-28 04:25

Example Problem

As a simple example, consider the numpy array arr as defined below:

import numpy as np
arr = np.array([[5, np.nan, np.         


        
相关标签:
5条回答
  • 2020-11-28 05:03

    Here's one approach -

    mask = np.isnan(arr)
    idx = np.where(~mask,np.arange(mask.shape[1]),0)
    np.maximum.accumulate(idx,axis=1, out=idx)
    out = arr[np.arange(idx.shape[0])[:,None], idx]
    

    If you don't want to create another array and just fill the NaNs in arr itself, replace the last step with this -

    arr[mask] = arr[np.nonzero(mask)[0], idx[mask]]
    

    Sample input, output -

    In [179]: arr
    Out[179]: 
    array([[  5.,  nan,  nan,   7.,   2.,   6.,   5.],
           [  3.,  nan,   1.,   8.,  nan,   5.,  nan],
           [  4.,   9.,   6.,  nan,  nan,  nan,   7.]])
    
    In [180]: out
    Out[180]: 
    array([[ 5.,  5.,  5.,  7.,  2.,  6.,  5.],
           [ 3.,  3.,  1.,  8.,  8.,  5.,  5.],
           [ 4.,  9.,  6.,  6.,  6.,  6.,  7.]])
    
    0 讨论(0)
  • 2020-11-28 05:04

    Use Numba. This should give a significant speedup:

    import numba
    @numba.jit
    def loops_fill(arr):
        ...
    
    0 讨论(0)
  • 2020-11-28 05:13

    For those who are interested in the problem of having leading np.nan after foward-filling, the following works:

    mask = np.isnan(arr)
    first_non_zero_idx = (~mask!=0).argmax(axis=1) #Get indices of first non-zero values
    arr = [ np.hstack([
                 [arr[i,first_nonzero]]*(first_nonzero), 
                 arr[i,first_nonzero:]])
                 for i, first_nonzero in enumerate(first_non_zero_idx) ]
    
    0 讨论(0)
  • 2020-11-28 05:18

    For those that came here looking for the backward-fill of NaN values, I modified the solution provided by Divakar above to do exactly that. The trick is that you have to do the accumulation on the reversed array using the minimum except for the maximum.

    Here is the code:

    
    
    # As provided in the answer by Divakar
    def ffill(arr):
        mask = np.isnan(arr)
        idx = np.where(~mask, np.arange(mask.shape[1]), 0)
        np.maximum.accumulate(idx, axis=1, out=idx)
        out = arr[np.arange(idx.shape[0])[:,None], idx]
        return out
    
    # My modification to do a backward-fill
    def bfill(arr):
        mask = np.isnan(arr)
        idx = np.where(~mask, np.arange(mask.shape[1]), mask.shape[1] - 1)
        idx = np.minimum.accumulate(idx[:, ::-1], axis=1)[:, ::-1]
        out = arr[np.arange(idx.shape[0])[:,None], idx]
        return out
    
    
    # Test both functions
    arr = np.array([[5, np.nan, np.nan, 7, 2],
                    [3, np.nan, 1, 8, np.nan],
                    [4, 9, 6, np.nan, np.nan]])
    print('Array:')
    print(arr)
    
    print('\nffill')
    print(ffill(arr))
    
    print('\nbfill')
    print(bfill(arr))
    
    

    Output:

    Array:
    [[ 5. nan nan  7.  2.]
     [ 3. nan  1.  8. nan]
     [ 4.  9.  6. nan nan]]
    
    ffill
    [[5. 5. 5. 7. 2.]
     [3. 3. 1. 8. 8.]
     [4. 9. 6. 6. 6.]]
    
    bfill
    [[ 5.  7.  7.  7.  2.]
     [ 3.  1.  1.  8. nan]
     [ 4.  9.  6. nan nan]]
    

    Edit: Update according to comment of MS_

    0 讨论(0)
  • 2020-11-28 05:18

    I liked Divakar's answer on pure numpy. Here's a generalized function for n-dimensional arrays:

    def np_ffill(arr, axis):
        idx_shape = tuple([slice(None)] + [np.newaxis] * (len(arr.shape) - axis - 1))
        idx = np.where(~np.isnan(arr), np.arange(arr.shape[axis])[idx_shape], 0)
        np.maximum.accumulate(idx, axis=axis, out=idx)
        slc = [np.arange(k)[tuple([slice(None) if dim==i else np.newaxis
            for dim in range(len(arr.shape))])]
            for i, k in enumerate(arr.shape)]
        slc[axis] = idx
        return arr[tuple(slc)]
    

    AFIK pandas can only work with two dimensions, despite having multi-index to make up for it. The only way to accomplish this would be to flatten a DataFrame, unstack desired level, restack, and finally reshape as original. This unstacking/restacking/reshaping, with the pandas sorting involved, is just unnecessary overhead to achieve the same result.

    Testing:

    def random_array(shape):
        choices = [1, 2, 3, 4, np.nan]
        out = np.random.choice(choices, size=shape)
        return out
    
    ra = random_array((2, 4, 8))
    print('arr')
    print(ra)
    print('\nffull')
    print(np_ffill(ra, 1))
    raise SystemExit
    

    Output:

    arr
    [[[ 3. nan  4.  1.  4.  2.  2.  3.]
      [ 2. nan  1.  3. nan  4.  4.  3.]
      [ 3.  2. nan  4. nan nan  3.  4.]
      [ 2.  2.  2. nan  1.  1. nan  2.]]
    
     [[ 2.  3.  2. nan  3.  3.  3.  3.]
      [ 3.  3.  1.  4.  1.  4.  1. nan]
      [ 4.  2. nan  4.  4.  3. nan  4.]
      [ 2.  4.  2.  1.  4.  1.  3. nan]]]
    
    ffull
    [[[ 3. nan  4.  1.  4.  2.  2.  3.]
      [ 2. nan  1.  3.  4.  4.  4.  3.]
      [ 3.  2.  1.  4.  4.  4.  3.  4.]
      [ 2.  2.  2.  4.  1.  1.  3.  2.]]
    
     [[ 2.  3.  2. nan  3.  3.  3.  3.]
      [ 3.  3.  1.  4.  1.  4.  1.  3.]
      [ 4.  2.  1.  4.  4.  3.  1.  4.]
      [ 2.  4.  2.  1.  4.  1.  3.  4.]]]
    
    0 讨论(0)
提交回复
热议问题