I want know the first year with incoming revenue for various projects.
Given the following, dataframe:
ID Y1 Y2 Y3
0 NaN 8 4
1
Avoiding apply
is preferable as its not vectorized. The following is vectorized. It was tested with Pandas 1.1.
import numpy as np
import pandas as pd
df = pd.DataFrame({'Y1':[np.nan, np.nan, np.nan, 5],'Y2':[8, np.nan, np.nan, 3], 'Y3':[4, 1, np.nan, np.nan]})
# df.dropna(how='all', inplace=True) # Optional but cleaner
# For ranking only:
col_ranks = pd.DataFrame(index=df.columns, data=np.arange(1, 1 + len(df.columns)), columns=['first_notna_rank'], dtype='UInt8') # UInt8 supports max value of 255.
df['first_notna_name'] = df.dropna(how='all').notna().idxmax(axis=1).astype('string')
If df
has no rows with all nulls, dropna(how='all)
above can be removed.
If df
has no rows with all nulls:
df['first_notna_value'] = df.lookup(row_labels=df.index, col_labels=df['first_notna_name'])
If df
may have rows with all nulls: (inefficient)
df['first_notna_value'] = df.drop(columns='first_notna_name').bfill(axis=1).iloc[:, 0]
df = df.merge(col_ranks, how='left', left_on='first_notna_name', right_index=True)
Is there a better way?
Y1 Y2 Y3 first_notna_name first_notna_value first_notna_rank
0 NaN 8.0 4.0 Y2 8.0 2
1 NaN NaN 1.0 Y3 1.0 3
2 NaN NaN NaN <NA> NaN <NA>
3 5.0 3.0 NaN Y1 5.0 1
Partial credit: answers by piRSquared and Andy
Apply this code to a dataframe with only one row to return the first column in the row that contains a null value.
row.columns[~(row.loc[:].isna()).all()][-1]
You can apply first_valid_index
to each row in the dataframe using a lambda expression with axis=1 to specify rows.
>>> df.apply(lambda row: row.first_valid_index(), axis=1)
ID
0 Y2
1 Y3
2 None
3 Y1
dtype: object
To apply it to your dataframe:
df = df.assign(first = df.apply(lambda row: row.first_valid_index(), axis=1))
>>> df
Y1 Y2 Y3 first
ID
0 NaN 8 4 Y2
1 NaN NaN 1 Y3
2 NaN NaN NaN None
3 5 3 NaN Y1