问题
Suppose I have the following dataframe:
pd.DataFrame({'col1': ["a", "a", np.nan, np.nan, np.nan],
'override1': ["b", np.nan, "b", np.nan, np.nan],
'override2': ["c", np.nan, np.nan, "c", np.nan]})
col1 override1 override2
0 a b c
1 a NaN NaN
2 NaN b NaN
3 NaN NaN c
4 NaN NaN NaN
Is there a way to collapse the 3 columns into one column, where override2
overrides override1
, which overrides col1
, however, in case there is NaN, then the values bofore is to be kept? Also, I am mainly looking for a way where I would not have to make an additional column. I am really looking for a built-in pandas solution.
This is the output I am looking for:
collapsed
0 c
1 a
2 b
3 c
4 NaN
回答1:
using ffill
df.ffill(1).iloc[:,-1]
回答2:
Performance NOT in mind but rather beauty and elegance (-:
df.stack().groupby(level=0).last().reindex(df.index)
0 c
1 a
2 b
3 c
4 NaN
dtype: object
回答3:
A straightforward solution involves forward filling and picking off the last column. This was mentioned in the comments.
df.ffill(1).iloc[:,-1].to_frame(name='collapsed')
collapsed
0 c
1 a
2 b
3 c
4 NaN
If you're interested in performance, we can use a modified version of Divakar's justify function:
pd.DataFrame({'collapsed': justify(
df.values, invalid_val=np.nan, axis=1, side='right')[:,-1]
})
collapsed
0 c
1 a
2 b
3 c
4 NaN
Reference.
def justify(a, invalid_val=0, axis=1, side='left'): """ Justifies a 2D array Parameters ---------- A : ndarray Input array to be justified axis : int Axis along which justification is to be made side : str Direction of justification. It could be 'left', 'right', 'up', 'down' It should be 'left' or 'right' for axis=1 and 'up' or 'down' for axis=0. """ if invalid_val is np.nan: mask = pd.notna(a) # modified for strings else: mask = a!=invalid_val justified_mask = np.sort(mask,axis=axis) if (side=='up') | (side=='left'): justified_mask = np.flip(justified_mask,axis=axis) out = np.full(a.shape, invalid_val, dtype=a.dtype) if axis==1: out[justified_mask] = a[mask] else: out.T[justified_mask.T] = a.T[mask.T] return out
回答4:
With focus on performance, here's one with NumPy -
In [106]: idx = df.shape[1] - 1 - df.notnull().to_numpy()[:,::-1].argmax(1)
In [107]: pd.Series(df.to_numpy()[np.arange(len(df)),idx])
Out[107]:
0 c
1 a
2 b
3 c
4 NaN
dtype: object
回答5:
Here's one approach:
df.lookup(df.index , df.notna().cumsum(1).idxmax(1))
# array(['c', 'a', 'b', 'c', nan], dtype=object)
Or equivalently working with the underlying numpy
arrays, and changing idxmax with ndarray.argmax:
df.values[df.index, df.notna().cumsum(1).values.argmax(1)]
# array(['c', 'a', 'b', 'c', nan], dtype=object)
回答6:
import pandas as pd
import numpy as np
df=pd.DataFrame({'col1': ["a", "a", np.nan, np.nan, np.nan],
'override1': ["b", np.nan, "b", np.nan, np.nan],
'override2': ["c", np.nan, np.nan, "c", np.nan]})
print(df)
df=df['col1'].fillna('') + df['override1'].fillna('')+ df['override2'].fillna('')
print(df)
来源:https://stackoverflow.com/questions/56583174/how-to-collapse-columns-in-pandas-on-null-values