Getting last non na value across rows in a pandas dataframe

前端 未结 3 2074
北海茫月
北海茫月 2020-12-11 05:22

I have a dataframe of shape (40,500). Each row in the dataframe has some numerical values till some variable column number k, and all the entries after that are nan.

<
相关标签:
3条回答
  • 2020-12-11 05:35

    Here's a NumPy based solution -

    In [113]: a
    Out[113]: 
    array([[ 17.,  53.,  nan,  63.,  66.,  nan,  nan,  nan,  nan,  nan],
           [ 54.,  96.,  71.,  20.,  70.,  58.,  91.,  nan,  nan,  nan],
           [ 58.,  26.,  72.,  93.,  58.,  29.,  44.,  28.,  36.,  88.],
           [ nan,  nan,  nan,  nan,  nan,  nan,  nan,  nan,  nan,  nan],
           [ 94.,  23.,  nan,  nan,  92.,  81.,  40.,  30.,  84.,  nan]])
    
    In [114]: m = ~np.isnan(a)
    
    In [115]: a[np.arange(m.shape[0]), m.shape[1]-m[:,::-1].argmax(1)-1]
    Out[115]: array([ 66.,  91.,  88.,  nan,  84.])
    

    To port this for dataframe, first off we can extract the values as an array : a = df.values and finally make the output dataframe :

    vals = a[np.arange(m.shape[0]), m.shape[1]-m[:,::-1].argmax(1)-1]
    df_out = pd.DataFrame(vals,index=df.index)
    
    0 讨论(0)
  • 2020-12-11 05:48

    use agg('last')

    df.groupby(['status'] * df.shape[1], 1).agg('last')
    


    'last' within agg produces that last valid value within group. I passed a list of length equal to the number of columns. Each value of this list is 'status'. That means that I'm grouping by one group. The result is a dataframe with one column named 'status'

    0 讨论(0)
  • 2020-12-11 05:50

    You need last_valid_index with custom function, because if all values are NaN it return KeyError:

    def f(x):
        if x.last_valid_index() is None:
            return np.nan
        else:
            return x[x.last_valid_index()]
    
    df['status'] = df.apply(f, axis=1)
    print (df)
                    1      2      3      4      5      6      7      8      9  \
    0                                                                           
    2016-06-02  7.080  7.079  7.079  7.079  7.079  7.079    NaN    NaN    NaN   
    2016-06-08  7.053  7.053  7.053  7.053  7.053  7.054    NaN    NaN    NaN   
    2016-06-09  7.061  7.061  7.060  7.060  7.060  7.060    NaN    NaN    NaN   
    2016-06-14    NaN    NaN    NaN    NaN    NaN    NaN    NaN    NaN    NaN   
    2016-06-15  7.066  7.066  7.066  7.066    NaN    NaN    NaN    NaN    NaN   
    2016-06-16  7.067  7.067  7.067  7.067  7.067  7.067  7.068  7.068    NaN   
    2016-06-21  7.053  7.053  7.052    NaN    NaN    NaN    NaN    NaN    NaN   
    2016-06-22  7.049  7.049    NaN    NaN    NaN    NaN    NaN    NaN    NaN   
    2016-06-28  7.058  7.058  7.059  7.059  7.059  7.059  7.059  7.059  7.059   
    
                status  
    0                   
    2016-06-02   7.079  
    2016-06-08   7.054  
    2016-06-09   7.060  
    2016-06-14     NaN  
    2016-06-15   7.066  
    2016-06-16   7.068  
    2016-06-21   7.052  
    2016-06-22   7.049  
    2016-06-28   7.059  
    

    Alternative solution - fillna with method ffill and select last column by iloc:

    df['status'] = df.ffill(axis=1).iloc[:, -1]
    print (df)
                status  
    0                   
    2016-06-02   7.079  
    2016-06-08   7.054  
    2016-06-09   7.060  
    2016-06-14     NaN  
    2016-06-15   7.066  
    2016-06-16   7.068  
    2016-06-21   7.052  
    2016-06-22   7.049  
    2016-06-28   7.059  
    
    0 讨论(0)
提交回复
热议问题