pandas left join and update existing column

前端未结

关注

 5  1619

I am new to pandas and can\'t seem to get this to work with merge function:

>>> left       >>> right
   a  b   c       a  c   d 
0  1  4


                      
              相关标签:


      
      
        
          5条回答        

        
                         				            
            
           
            
                              
                
              
              
                
                  我在风中等你        
                
              
                            
                2020-12-01 16:19
              
            
            
                                                                       
DataFrame.update() is nice, but it doesn't let you specify columns to join on and more importantly, if the other dataframe has NaN values, those NaN values will not overwrite non-nan values in the original DataFrame. To me, this is undesirable behavior.

Here's a custom method I rolled to fix these issues. It's freshly written, so users beware..

join_insertion()

def join_insertion(into_df, from_df, on, cols, mult='error'):
    """
    Suppose A and B are dataframes. A has columns {foo, bar, baz} and B has columns {foo, baz, buz}
    This function allows you to do an operation like:
    "where A and B match via the column foo, insert the values of baz and buz from B into A"
    Note that this'll update A's values for baz and it'll insert buz as a new column.
    This is a lot like DataFrame.update(), but that method annoyingly ignores NaN values in B!

    :param into_df: dataframe you want to modify
    :param from_df: dataframe with the values you want to insert
    :param cols: list of column names (values to insert)
    :param on: list of column names (values to join on), or a dict of {into:from} column name pairs
    :param mult: if a key of into_df matches multiple rows of from_df, how should this be handled?
    an error can be raised, or the first matching value can be inserted, or the last matching value
    can be inserted
    :return: a modified copy of into_df, with updated values using from_df
    """

    # Infer left_on, right_on
    if (isinstance(on, dict)):
        left_on = list(on.keys())
        right_on = list(on.values())
    elif(isinstance(on, list)):
        left_on = on
        right_on = on
    elif(isinstance(on, str)):
        left_on = [on]
        right_on = [on]
    else:
        raise Exception("on should be a list or dictionary")

    # Make cols a list if it isn't already
    if(isinstance(cols, str)):
        cols = [cols]

    # Setup
    A = into_df.copy()
    B = from_df[right_on + cols].copy()

    # Insert row ids
    A['_A_RowId_'] = np.arange(A.shape[0])
    B['_B_RowId_'] = np.arange(B.shape[0])

    A = pd.merge(
        left=A,
        right=B,
        how='left',
        left_on=left_on,
        right_on=right_on,
        suffixes=(None, '_y'),
        indicator=True
    ).sort_values(['_A_RowId_', '_B_RowId_'])

    # Check for rows of A which got duplicated by the merge, and then handle appropriately
    if(mult == 'error'):
        if(A.groupby('_A_RowId_').size().max() > 1):
            raise Exception("At least one key of into_df matched multiple rows of from_df.")
    elif(mult == 'first'):
        A = A.groupby('_A_RowId_').first().reset_index()
    elif(mult == 'last'):
        A = A.groupby('_A_RowId_').last().reset_index()

    mask = A._merge == 'both'
    cols_in_both = list(set(into_df.columns.to_list()).intersection(set(cols)))
    for col in cols_in_both:
        A.loc[mask, col] = A.loc[mask, col + '_y']

    # Drop unwanted columns
    A.drop(columns=list(set(A.columns).difference(set(into_df.columns.to_list() + cols))), inplace=True)

    return A


Example Use

into_df = pd.DataFrame({
    'foo': [1, 2, 3],
    'bar': [4, 5, 6],
    'baz': [7, 8, 9]
})
   foo  bar  baz
0    1    4    7
1    2    5    8
2    3    6    9

from_df = pd.DataFrame({
    'foo': [1, 3, 5, 7, 3],
    'baz': [70, 80, 90, 30, 40],
    'buz': [0, 1, 2, 3, 4]
})
   foo  baz  buz
0    1   70    0
1    3   80    1
2    5   90    2
3    7   30    3
4    3   40    4

# Use it!

join_insertion(into_df, from_df, on='foo', cols=['baz','buz'], mult='error')
  Exception: At least one key of into_df matched multiple rows of from_df.

join_insertion(into_df, from_df, on='foo', cols=['baz','buz'], mult='first')
   foo  bar   baz  buz
0    1    4  70.0  0.0
1    2    5   8.0  NaN
2    3    6  80.0  1.0

join_insertion(into_df, from_df, on='foo', cols=['baz','buz'], mult='last')
   foo  bar   baz  buz
0    1    4  70.0  0.0
1    2    5   8.0  NaN
2    3    6  40.0  4.0


As an aside, this is one of those things I severely miss from R's data.table package. With data.table, this is as easy as x[y, Foo := i.Foo, on = c("a", "b")]
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  忘了有多久        
                
              
                            
                2020-12-01 16:23
              
            
            
                                                                       
One other way is to use pd.merge like so:

 >>> import pandas as pd

 >>> final = pd.merge(right, left, 
                      how='outer',
                      left_index=True,
                      right_index=True,
                      on=('a', 'c')
                     ).sort_index(axis=1)

 >>> final       
    a  b   c   d 
 0  1  4   7   13.0
 1  2  5   8   14.0
 2  3  6   9   15.0
 3  4  7   12  NaN 


You can compute the intersection of both DataFrames's columns names you want to update to pass it to the 'on=' parameter of the function.

It does not create unwanted columns that have to be dropped like with Zero's solution.

Edit:
The NaN value might change integers to floats in the same column.
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  猫巷女王i        
                
              
                            
                2020-12-01 16:26
              
            
            
                                                                       
One way to do this is to set the a column as the index and update:

In [11]: left_a = left.set_index('a')

In [12]: right_a = right.set_index('a')


Note: update only does a left join (not merges), so as well as set_index you also need to include the additional columns not present in left_a.

In [13]: res = left_a.reindex(columns=left_a.columns.union(right_a.columns))

In [14]: res.update(right_a)

In [15]: res.reset_index(inplace=True)

In [16]: res
Out[16]:
   a   b   c   d
0  1   4   7  13
1  2   5   8  14
2  3   6   9  15
3  4   7  12 NaN

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  无人及你        
                
              
                            
                2020-12-01 16:29
              
            
            
                                                                       
Here's a way to do it with join:

In [632]: t = left.set_index('a').join(right.set_index('a'), rsuffix='_right')

In [633]: t
Out[633]: 
   b   c  c_right   d
a                    
1  4   9        7  13
2  5  10        8  14
3  6  11        9  15
4  7  12      NaN NaN


Now, we want to set null values of c_right (which is from the right dataframe) with values from c column from the left dataframe. Updated the below process with a method taking from @John Galt's answer

In [657]: t['c_right'] = t['c_right'].fillna(t['c'])

In [658]: t
Out[658]: 
   b   c  c_right   d
a                    
1  4   9        7  13
2  5  10        8  14
3  6  11        9  15
4  7  12       12 NaN

In [659]: t.drop('c_right', axis=1)
Out[659]: 
   b   c   d
a           
1  4   9  13
2  5  10  14
3  6  11  15
4  7  12 NaN

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  迷失自我        
                
              
                            
                2020-12-01 16:32
              
            
            
                                                                       
You can use merge() between left and right with how='left' on 'a' column.

In [74]: final = left.merge(right, on='a', how='left')

In [75]: final
Out[75]:
   a  b  c_x  c_y   d
0  1  4    9    7  13
1  2  5   10    8  14
2  3  6   11    9  15
3  4  7   12  NaN NaN


Replace NaN value from c_y with c_x value

In [76]: final['c'] = final['c_y'].fillna(final['c_x'])

In [77]: final
Out[77]:
   a  b  c_x  c_y   d   c
0  1  4    9    7  13   7
1  2  5   10    8  14   8
2  3  6   11    9  15   9
3  4  7   12  NaN NaN  12


Drop unwanted columns, and you have the result

In [79]: final.drop(['c_x', 'c_y'], axis=1)
Out[79]:
   a  b   d   c
0  1  4  13   7
1  2  5  14   8
2  3  6  15   9
3  4  7 NaN  12

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
                             
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复