First non-null value per row from a list of Pandas columns

前端 未结 9 1177
难免孤独
难免孤独 2020-11-27 19:23

If I\'ve got a DataFrame in pandas which looks something like:

    A   B   C
0   1 NaN   2
1 NaN   3 NaN
2 NaN   4   5
3 NaN NaN NaN

How ca

相关标签:
9条回答
  • 2020-11-27 19:54

    I'm going to weigh in here as I think this is a good deal faster than any of the proposed methods. argmin gives the index of the first False value in each row of the result of np.isnan in a vectorized way, which is the hard part. It still relies on a Python loop to extract the values but the look up is very quick:

    def get_first_non_null(df):
        a = df.values
        col_index = np.isnan(a).argmin(axis=1)
        return [a[row, col] for row, col in enumerate(col_index)]
    

    EDIT: Here's a fully vectorized solution which is can be a good deal faster again depending on the shape of the input. Updated benchmarking below.

    def get_first_non_null_vec(df):
        a = df.values
        n_rows, n_cols = a.shape
        col_index = np.isnan(a).argmin(axis=1)
        flat_index = n_cols * np.arange(n_rows) + col_index
        return a.ravel()[flat_index]
    

    If a row is completely null then the corresponding value will be null also. Here's some benchmarking against unutbu's solution:

    df = pd.DataFrame(np.random.choice([1, np.nan], (10000, 1500), p=(0.01, 0.99)))
    #%timeit df.stack().groupby(level=0).first().reindex(df.index)
    %timeit get_first_non_null(df)
    %timeit get_first_non_null_vec(df)
    1 loops, best of 3: 220 ms per loop
    100 loops, best of 3: 16.2 ms per loop
    100 loops, best of 3: 12.6 ms per loop
    In [109]:
    
    
    df = pd.DataFrame(np.random.choice([1, np.nan], (100000, 150), p=(0.01, 0.99)))
    #%timeit df.stack().groupby(level=0).first().reindex(df.index)
    %timeit get_first_non_null(df)
    %timeit get_first_non_null_vec(df)
    1 loops, best of 3: 246 ms per loop
    10 loops, best of 3: 48.2 ms per loop
    100 loops, best of 3: 15.7 ms per loop
    
    
    df = pd.DataFrame(np.random.choice([1, np.nan], (1000000, 15), p=(0.01, 0.99)))
    %timeit df.stack().groupby(level=0).first().reindex(df.index)
    %timeit get_first_non_null(df)
    %timeit get_first_non_null_vec(df)
    1 loops, best of 3: 326 ms per loop
    1 loops, best of 3: 326 ms per loop
    10 loops, best of 3: 35.7 ms per loop
    
    0 讨论(0)
  • 2020-11-27 19:57

    This is a really messy way to do this, first use first_valid_index to get the valid columns, convert the returned series to a dataframe so we can call apply row-wise and use this to index back to original df:

    In [160]:
    def func(x):
        if x.values[0] is None:
            return None
        else:
            return df.loc[x.name, x.values[0]]
    pd.DataFrame(df.apply(lambda x: x.first_valid_index(), axis=1)).apply(func,axis=1)
    ​
    Out[160]:
    0     1
    1     3
    2     4
    3   NaN
    dtype: float64
    

    EDIT

    A slightly cleaner way:

    In [12]:
    def func(x):
        if x.first_valid_index() is None:
            return None
        else:
            return x[x.first_valid_index()]
    df.apply(func, axis=1)
    
    Out[12]:
    0     1
    1     3
    2     4
    3   NaN
    dtype: float64
    
    0 讨论(0)
  • 2020-11-27 19:57

    Fill the nans from the left with fillna, then get the leftmost column:

    df.fillna(method='bfill', axis=1).iloc[:, 0]
    
    0 讨论(0)
  • 2020-11-27 20:00

    JoeCondron's answer (EDIT: before his last edit!) is cool but there is margin for significant improvement by avoiding the non-vectorized enumeration:

    def get_first_non_null_vect(df):
        a = df.values
        col_index = np.isnan(a).argmin(axis=1)
        return a[np.arange(a.shape[0]), col_index]
    

    The improvement is small if the DataFrame is relatively flat:

    In [4]: df = pd.DataFrame(np.random.choice([1, np.nan], (10000, 1500), p=(0.01, 0.99)))
    
    In [5]: %timeit get_first_non_null(df)
    10 loops, best of 3: 34.9 ms per loop
    
    In [6]: %timeit get_first_non_null_vect(df)
    10 loops, best of 3: 31.6 ms per loop
    

    ... but can be relevant on slim DataFrames:

    In [7]: df = pd.DataFrame(np.random.choice([1, np.nan], (10000, 15), p=(0.1, 0.9)))
    
    In [8]: %timeit get_first_non_null(df)
    100 loops, best of 3: 3.75 ms per loop
    
    In [9]: %timeit get_first_non_null_vect(df)
    1000 loops, best of 3: 718 µs per loop
    

    Compared to JoeCondron's vectorized version, the runtime is very similar (this is still slightly quicker for slim DataFrames, and slightly slower for large ones).

    0 讨论(0)
  • 2020-11-27 20:02

    Here is another way to do it:

    In [183]: df.stack().groupby(level=0).first().reindex(df.index)
    Out[183]: 
    0     1
    1     3
    2     4
    3   NaN
    dtype: float64
    

    The idea here is to use stack to move the columns into a row index level:

    In [184]: df.stack()
    Out[184]: 
    0  A    1
       C    2
    1  B    3
    2  B    4
       C    5
    dtype: float64
    

    Now, if you group by the first row level -- i.e. the original index -- and take the first value from each group, you essentially get the desired result:

    In [185]: df.stack().groupby(level=0).first()
    Out[185]: 
    0    1
    1    3
    2    4
    dtype: float64
    

    All we need to do is reindex the result (using the original index) so as to include rows that are completely NaN:

    df.stack().groupby(level=0).first().reindex(df.index)
    
    0 讨论(0)
  • 2020-11-27 20:07

    Here is a one line solution:

    [row[row.first_valid_index()] if row.first_valid_index() else None for _, row in df.iterrows()]
    

    Edit:

    This solution iterates over rows of df. row.first_valid_index() returns label for first non-NA/null value, which will be used as index to get the first non-null item in each row.

    If there is no non-null value in the row, row.first_valid_index() would be None, thus cannot be used as index, so I need a if-else statement.

    I packed everything into a list comprehension for brevity.

    0 讨论(0)
提交回复
热议问题