Get first non-null value per row

后端 未结 4 1740
轻奢々
轻奢々 2020-12-20 21:04

I have a sample dataframe show as below. For each line, I want to check the c1 first, if it is not null, then check c2. By this way, find the first notnull column and store

相关标签:
4条回答
  • 2020-12-20 21:34

    Setup

    df = df.set_index('ID') # if necessary
    df
         c1   c2  c3   c4
    ID                   
    1     a    b   a  NaN
    2   NaN   cc  dd   cc
    3   NaN   ee  ff   ee
    4   NaN  NaN  gg   gg
    

    Solution
    stack + groupby + first
    stack implicitly drops NaNs, so groupby.first is guarantee to give you the first non-null value if it exists. Assigning the result back will expose any NaNs at missing indices which you can fillna with a subsequent call.

    df['result'] = df.stack().groupby(level=0).first()
    # df['result'] = df['result'].fillna('unknown') # if necessary 
    df
         c1   c2  c3   c4 result
    ID                          
    1     a    b   a  NaN      a
    2   NaN   cc  dd   cc     cc
    3   NaN   ee  ff   ee     ee
    4   NaN  NaN  gg   gg     gg
    

    (beware, this is slow for larger dataframes, for performance you may use @jezrael's solution)

    0 讨论(0)
  • 2020-12-20 21:35

    One way is to use pd.DataFrame.lookup with pd.Series.first_valid_index applied on a transposed dataframe:

    df = pd.DataFrame({'ID': [1, 2, 3, 4],
                       'c1': ['a', '', '', ''],
                       'c2': ['b', 'cc', '', ''],
                       'c3': ['' , 'dd', 'ee', ''],
                       'c4': ['', '', 'ff', 'gg']})
    
    df = df.replace('', np.nan)
    
    df['result'] = df.lookup(df.index, df.iloc[:, 1:].T.apply(pd.Series.first_valid_index))
    
    print(df)
    
       ID   c1   c2   c3   c4 result
    0   1    a    b  NaN  NaN      a
    1   2  NaN   cc   dd  NaN     cc
    2   3  NaN  NaN   ee   ff     ee
    3   4  NaN  NaN  NaN   gg     gg
    
    0 讨论(0)
  • 2020-12-20 21:47

    I am using lookup and data from Jpp

    df=df.set_index('ID')
    s=df.ne('').idxmax(1)
    df['Result']=df.lookup(s.index,s)
    df
    Out[492]: 
       c1  c2  c3  c4 Result
    ID                      
    1   a   b              a
    2      cc  dd         cc
    3          ee  ff     ee
    4              gg     gg
    
    0 讨论(0)
  • 2020-12-20 21:54

    Use back filling NaNs first and then select first column by iloc:

    df['result'] = df[['c1','c2','c3','c4']].bfill(axis=1).iloc[:, 0].fillna('unknown')
    

    Or:

    df['result'] = df.iloc[:, 1:].bfill(axis=1).iloc[:, 0].fillna('unknown')
    

    print (df)
       ID   c1   c2  c3   c4 result
    0   1    a    b   a  NaN      a
    1   2  NaN   cc  dd   cc     cc
    2   3  NaN   ee  ff   ee     ee
    3   4  NaN  NaN  gg   gg     gg
    

    Performance:

    df = pd.concat([df] * 1000, ignore_index=True)
    
    
    In [220]: %timeit df['result'] = df[['c1','c2','c3','c4']].bfill(axis=1).iloc[:, 0].fillna('unknown')
    100 loops, best of 3: 2.78 ms per loop
    
    In [221]: %timeit df['result'] = df.iloc[:, 1:].bfill(axis=1).iloc[:, 0].fillna('unknown')
    100 loops, best of 3: 2.7 ms per loop
    
    #jpp solution
    In [222]: %%timeit
         ...: cols = df.iloc[:, 1:].T.apply(pd.Series.first_valid_index)
         ...: 
         ...: df['result'] = [df.loc[i, cols[i]] for i in range(len(df.index))]
         ...: 
    1 loop, best of 3: 180 ms per loop
    
    #cᴏʟᴅsᴘᴇᴇᴅ'  s solution
    In [223]: %timeit df['result'] = df.stack().groupby(level=0).first()
    1 loop, best of 3: 606 ms per loop
    
    0 讨论(0)
提交回复
热议问题