pandas left join and update existing column

前端 未结 5 1619
忘掉有多难
忘掉有多难 2020-12-01 16:16

I am new to pandas and can\'t seem to get this to work with merge function:

>>> left       >>> right
   a  b   c       a  c   d 
0  1  4            


        
相关标签:
5条回答
  • 2020-12-01 16:19

    DataFrame.update() is nice, but it doesn't let you specify columns to join on and more importantly, if the other dataframe has NaN values, those NaN values will not overwrite non-nan values in the original DataFrame. To me, this is undesirable behavior.

    Here's a custom method I rolled to fix these issues. It's freshly written, so users beware..

    join_insertion()

    def join_insertion(into_df, from_df, on, cols, mult='error'):
        """
        Suppose A and B are dataframes. A has columns {foo, bar, baz} and B has columns {foo, baz, buz}
        This function allows you to do an operation like:
        "where A and B match via the column foo, insert the values of baz and buz from B into A"
        Note that this'll update A's values for baz and it'll insert buz as a new column.
        This is a lot like DataFrame.update(), but that method annoyingly ignores NaN values in B!
    
        :param into_df: dataframe you want to modify
        :param from_df: dataframe with the values you want to insert
        :param cols: list of column names (values to insert)
        :param on: list of column names (values to join on), or a dict of {into:from} column name pairs
        :param mult: if a key of into_df matches multiple rows of from_df, how should this be handled?
        an error can be raised, or the first matching value can be inserted, or the last matching value
        can be inserted
        :return: a modified copy of into_df, with updated values using from_df
        """
    
        # Infer left_on, right_on
        if (isinstance(on, dict)):
            left_on = list(on.keys())
            right_on = list(on.values())
        elif(isinstance(on, list)):
            left_on = on
            right_on = on
        elif(isinstance(on, str)):
            left_on = [on]
            right_on = [on]
        else:
            raise Exception("on should be a list or dictionary")
    
        # Make cols a list if it isn't already
        if(isinstance(cols, str)):
            cols = [cols]
    
        # Setup
        A = into_df.copy()
        B = from_df[right_on + cols].copy()
    
        # Insert row ids
        A['_A_RowId_'] = np.arange(A.shape[0])
        B['_B_RowId_'] = np.arange(B.shape[0])
    
        A = pd.merge(
            left=A,
            right=B,
            how='left',
            left_on=left_on,
            right_on=right_on,
            suffixes=(None, '_y'),
            indicator=True
        ).sort_values(['_A_RowId_', '_B_RowId_'])
    
        # Check for rows of A which got duplicated by the merge, and then handle appropriately
        if(mult == 'error'):
            if(A.groupby('_A_RowId_').size().max() > 1):
                raise Exception("At least one key of into_df matched multiple rows of from_df.")
        elif(mult == 'first'):
            A = A.groupby('_A_RowId_').first().reset_index()
        elif(mult == 'last'):
            A = A.groupby('_A_RowId_').last().reset_index()
    
        mask = A._merge == 'both'
        cols_in_both = list(set(into_df.columns.to_list()).intersection(set(cols)))
        for col in cols_in_both:
            A.loc[mask, col] = A.loc[mask, col + '_y']
    
        # Drop unwanted columns
        A.drop(columns=list(set(A.columns).difference(set(into_df.columns.to_list() + cols))), inplace=True)
    
        return A
    

    Example Use

    into_df = pd.DataFrame({
        'foo': [1, 2, 3],
        'bar': [4, 5, 6],
        'baz': [7, 8, 9]
    })
       foo  bar  baz
    0    1    4    7
    1    2    5    8
    2    3    6    9
    
    from_df = pd.DataFrame({
        'foo': [1, 3, 5, 7, 3],
        'baz': [70, 80, 90, 30, 40],
        'buz': [0, 1, 2, 3, 4]
    })
       foo  baz  buz
    0    1   70    0
    1    3   80    1
    2    5   90    2
    3    7   30    3
    4    3   40    4
    
    # Use it!
    
    join_insertion(into_df, from_df, on='foo', cols=['baz','buz'], mult='error')
      Exception: At least one key of into_df matched multiple rows of from_df.
    
    join_insertion(into_df, from_df, on='foo', cols=['baz','buz'], mult='first')
       foo  bar   baz  buz
    0    1    4  70.0  0.0
    1    2    5   8.0  NaN
    2    3    6  80.0  1.0
    
    join_insertion(into_df, from_df, on='foo', cols=['baz','buz'], mult='last')
       foo  bar   baz  buz
    0    1    4  70.0  0.0
    1    2    5   8.0  NaN
    2    3    6  40.0  4.0
    

    As an aside, this is one of those things I severely miss from R's data.table package. With data.table, this is as easy as x[y, Foo := i.Foo, on = c("a", "b")]

    0 讨论(0)
  • 2020-12-01 16:23

    One other way is to use pd.merge like so:

     >>> import pandas as pd
    
     >>> final = pd.merge(right, left, 
                          how='outer',
                          left_index=True,
                          right_index=True,
                          on=('a', 'c')
                         ).sort_index(axis=1)
    
     >>> final       
        a  b   c   d 
     0  1  4   7   13.0
     1  2  5   8   14.0
     2  3  6   9   15.0
     3  4  7   12  NaN 
    

    You can compute the intersection of both DataFrames's columns names you want to update to pass it to the 'on=' parameter of the function.

    It does not create unwanted columns that have to be dropped like with Zero's solution.

    Edit: The NaN value might change integers to floats in the same column.

    0 讨论(0)
  • 2020-12-01 16:26

    One way to do this is to set the a column as the index and update:

    In [11]: left_a = left.set_index('a')
    
    In [12]: right_a = right.set_index('a')
    

    Note: update only does a left join (not merges), so as well as set_index you also need to include the additional columns not present in left_a.

    In [13]: res = left_a.reindex(columns=left_a.columns.union(right_a.columns))
    
    In [14]: res.update(right_a)
    
    In [15]: res.reset_index(inplace=True)
    
    In [16]: res
    Out[16]:
       a   b   c   d
    0  1   4   7  13
    1  2   5   8  14
    2  3   6   9  15
    3  4   7  12 NaN
    
    0 讨论(0)
  • 2020-12-01 16:29

    Here's a way to do it with join:

    In [632]: t = left.set_index('a').join(right.set_index('a'), rsuffix='_right')
    
    In [633]: t
    Out[633]: 
       b   c  c_right   d
    a                    
    1  4   9        7  13
    2  5  10        8  14
    3  6  11        9  15
    4  7  12      NaN NaN
    

    Now, we want to set null values of c_right (which is from the right dataframe) with values from c column from the left dataframe. Updated the below process with a method taking from @John Galt's answer

    In [657]: t['c_right'] = t['c_right'].fillna(t['c'])
    
    In [658]: t
    Out[658]: 
       b   c  c_right   d
    a                    
    1  4   9        7  13
    2  5  10        8  14
    3  6  11        9  15
    4  7  12       12 NaN
    
    In [659]: t.drop('c_right', axis=1)
    Out[659]: 
       b   c   d
    a           
    1  4   9  13
    2  5  10  14
    3  6  11  15
    4  7  12 NaN
    
    0 讨论(0)
  • 2020-12-01 16:32

    You can use merge() between left and right with how='left' on 'a' column.

    In [74]: final = left.merge(right, on='a', how='left')
    
    In [75]: final
    Out[75]:
       a  b  c_x  c_y   d
    0  1  4    9    7  13
    1  2  5   10    8  14
    2  3  6   11    9  15
    3  4  7   12  NaN NaN
    

    Replace NaN value from c_y with c_x value

    In [76]: final['c'] = final['c_y'].fillna(final['c_x'])
    
    In [77]: final
    Out[77]:
       a  b  c_x  c_y   d   c
    0  1  4    9    7  13   7
    1  2  5   10    8  14   8
    2  3  6   11    9  15   9
    3  4  7   12  NaN NaN  12
    

    Drop unwanted columns, and you have the result

    In [79]: final.drop(['c_x', 'c_y'], axis=1)
    Out[79]:
       a  b   d   c
    0  1  4  13   7
    1  2  5  14   8
    2  3  6  15   9
    3  4  7 NaN  12
    
    0 讨论(0)
提交回复
热议问题