First non-null value per row from a list of Pandas columns

前端未结

关注

 9  1177

If I\'ve got a DataFrame in pandas which looks something like:

    A   B   C
0   1 NaN   2
1 NaN   3 NaN
2 NaN   4   5
3 NaN NaN NaN

How ca

相关标签:

9条回答

春和景丽

2020-11-27 19:54

I'm going to weigh in here as I think this is a good deal faster than any of the proposed methods. argmin gives the index of the first False value in each row of the result of np.isnan in a vectorized way, which is the hard part. It still relies on a Python loop to extract the values but the look up is very quick:

def get_first_non_null(df):
    a = df.values
    col_index = np.isnan(a).argmin(axis=1)
    return [a[row, col] for row, col in enumerate(col_index)]

EDIT: Here's a fully vectorized solution which is can be a good deal faster again depending on the shape of the input. Updated benchmarking below.

def get_first_non_null_vec(df):
    a = df.values
    n_rows, n_cols = a.shape
    col_index = np.isnan(a).argmin(axis=1)
    flat_index = n_cols * np.arange(n_rows) + col_index
    return a.ravel()[flat_index]

If a row is completely null then the corresponding value will be null also. Here's some benchmarking against unutbu's solution:

df = pd.DataFrame(np.random.choice([1, np.nan], (10000, 1500), p=(0.01, 0.99)))
#%timeit df.stack().groupby(level=0).first().reindex(df.index)
%timeit get_first_non_null(df)
%timeit get_first_non_null_vec(df)
1 loops, best of 3: 220 ms per loop
100 loops, best of 3: 16.2 ms per loop
100 loops, best of 3: 12.6 ms per loop
In [109]:


df = pd.DataFrame(np.random.choice([1, np.nan], (100000, 150), p=(0.01, 0.99)))
#%timeit df.stack().groupby(level=0).first().reindex(df.index)
%timeit get_first_non_null(df)
%timeit get_first_non_null_vec(df)
1 loops, best of 3: 246 ms per loop
10 loops, best of 3: 48.2 ms per loop
100 loops, best of 3: 15.7 ms per loop


df = pd.DataFrame(np.random.choice([1, np.nan], (1000000, 15), p=(0.01, 0.99)))
%timeit df.stack().groupby(level=0).first().reindex(df.index)
%timeit get_first_non_null(df)
%timeit get_first_non_null_vec(df)
1 loops, best of 3: 326 ms per loop
1 loops, best of 3: 326 ms per loop
10 loops, best of 3: 35.7 ms per loop

0 讨论(0)

闹比i

2020-11-27 19:57

This is a really messy way to do this, first use first_valid_index to get the valid columns, convert the returned series to a dataframe so we can call apply row-wise and use this to index back to original df:

In [160]:
def func(x):
    if x.values[0] is None:
        return None
    else:
        return df.loc[x.name, x.values[0]]
pd.DataFrame(df.apply(lambda x: x.first_valid_index(), axis=1)).apply(func,axis=1)

Out[160]:
0     1
1     3
2     4
3   NaN
dtype: float64

EDIT

A slightly cleaner way:

In [12]:
def func(x):
    if x.first_valid_index() is None:
        return None
    else:
        return x[x.first_valid_index()]
df.apply(func, axis=1)

Out[12]:
0     1
1     3
2     4
3   NaN
dtype: float64

0 讨论(0)

庸人自扰

2020-11-27 19:57
Fill the nans from the left with fillna, then get the leftmost column:
```
df.fillna(method='bfill', axis=1).iloc[:, 0]
```
0 讨论(0)
发布评论:

提交评论
- 加载中...

逝去的感伤

2020-11-27 20:00

JoeCondron's answer (EDIT: before his last edit!) is cool but there is margin for significant improvement by avoiding the non-vectorized enumeration:

def get_first_non_null_vect(df):
    a = df.values
    col_index = np.isnan(a).argmin(axis=1)
    return a[np.arange(a.shape[0]), col_index]

The improvement is small if the DataFrame is relatively flat:

In [4]: df = pd.DataFrame(np.random.choice([1, np.nan], (10000, 1500), p=(0.01, 0.99)))

In [5]: %timeit get_first_non_null(df)
10 loops, best of 3: 34.9 ms per loop

In [6]: %timeit get_first_non_null_vect(df)
10 loops, best of 3: 31.6 ms per loop

... but can be relevant on slim DataFrames:

In [7]: df = pd.DataFrame(np.random.choice([1, np.nan], (10000, 15), p=(0.1, 0.9)))

In [8]: %timeit get_first_non_null(df)
100 loops, best of 3: 3.75 ms per loop

In [9]: %timeit get_first_non_null_vect(df)
1000 loops, best of 3: 718 µs per loop

Compared to JoeCondron's vectorized version, the runtime is very similar (this is still slightly quicker for slim DataFrames, and slightly slower for large ones).

0 讨论(0)

灰色年华

2020-11-27 20:02
Here is another way to do it:
```
In [183]: df.stack().groupby(level=0).first().reindex(df.index)
Out[183]: 
0     1
1     3
2     4
3   NaN
dtype: float64
```
The idea here is to use stack to move the columns into a row index level:
```
In [184]: df.stack()
Out[184]: 
0  A    1
   C    2
1  B    3
2  B    4
   C    5
dtype: float64
```
Now, if you group by the first row level -- i.e. the original index -- and take the first value from each group, you essentially get the desired result:
```
In [185]: df.stack().groupby(level=0).first()
Out[185]: 
0    1
1    3
2    4
dtype: float64
```
All we need to do is reindex the result (using the original index) so as to include rows that are completely NaN:
```
df.stack().groupby(level=0).first().reindex(df.index)
```
0 讨论(0)
发布评论:

提交评论
- 加载中...
孤城傲影

2020-11-27 20:07
Here is a one line solution:
```
[row[row.first_valid_index()] if row.first_valid_index() else None for _, row in df.iterrows()]
```
Edit:

This solution iterates over rows of df. row.first_valid_index() returns label for first non-NA/null value, which will be used as index to get the first non-null item in each row.

If there is no non-null value in the row, row.first_valid_index() would be None, thus cannot be used as index, so I need a if-else statement.

I packed everything into a list comprehension for brevity.
0 讨论(0)
发布评论:

提交评论
- 加载中...

1 2 下一页