I have a dataframe of shape (40,500). Each row in the dataframe has some numerical values till some variable column number k, and all the entries after that are nan.
<Here's a NumPy based solution -
In [113]: a
Out[113]:
array([[ 17., 53., nan, 63., 66., nan, nan, nan, nan, nan],
[ 54., 96., 71., 20., 70., 58., 91., nan, nan, nan],
[ 58., 26., 72., 93., 58., 29., 44., 28., 36., 88.],
[ nan, nan, nan, nan, nan, nan, nan, nan, nan, nan],
[ 94., 23., nan, nan, 92., 81., 40., 30., 84., nan]])
In [114]: m = ~np.isnan(a)
In [115]: a[np.arange(m.shape[0]), m.shape[1]-m[:,::-1].argmax(1)-1]
Out[115]: array([ 66., 91., 88., nan, 84.])
To port this for dataframe, first off we can extract the values as an array : a = df.values
and finally make the output dataframe :
vals = a[np.arange(m.shape[0]), m.shape[1]-m[:,::-1].argmax(1)-1]
df_out = pd.DataFrame(vals,index=df.index)
use agg('last')
df.groupby(['status'] * df.shape[1], 1).agg('last')
'last' within agg produces that last valid value within group. I passed a list of length equal to the number of columns. Each value of this list is 'status'. That means that I'm grouping by one group. The result is a dataframe with one column named 'status'
You need last_valid_index with custom function, because if all values are NaN
it return KeyError
:
def f(x):
if x.last_valid_index() is None:
return np.nan
else:
return x[x.last_valid_index()]
df['status'] = df.apply(f, axis=1)
print (df)
1 2 3 4 5 6 7 8 9 \
0
2016-06-02 7.080 7.079 7.079 7.079 7.079 7.079 NaN NaN NaN
2016-06-08 7.053 7.053 7.053 7.053 7.053 7.054 NaN NaN NaN
2016-06-09 7.061 7.061 7.060 7.060 7.060 7.060 NaN NaN NaN
2016-06-14 NaN NaN NaN NaN NaN NaN NaN NaN NaN
2016-06-15 7.066 7.066 7.066 7.066 NaN NaN NaN NaN NaN
2016-06-16 7.067 7.067 7.067 7.067 7.067 7.067 7.068 7.068 NaN
2016-06-21 7.053 7.053 7.052 NaN NaN NaN NaN NaN NaN
2016-06-22 7.049 7.049 NaN NaN NaN NaN NaN NaN NaN
2016-06-28 7.058 7.058 7.059 7.059 7.059 7.059 7.059 7.059 7.059
status
0
2016-06-02 7.079
2016-06-08 7.054
2016-06-09 7.060
2016-06-14 NaN
2016-06-15 7.066
2016-06-16 7.068
2016-06-21 7.052
2016-06-22 7.049
2016-06-28 7.059
Alternative solution - fillna with method ffill
and select last column by iloc:
df['status'] = df.ffill(axis=1).iloc[:, -1]
print (df)
status
0
2016-06-02 7.079
2016-06-08 7.054
2016-06-09 7.060
2016-06-14 NaN
2016-06-15 7.066
2016-06-16 7.068
2016-06-21 7.052
2016-06-22 7.049
2016-06-28 7.059