Anti-Join Pandas

前端 未结 6 1061
梦谈多话
梦谈多话 2020-12-25 10:36

I have two tables and I would like to append them so that only all the data in table A is retained and data from table B is only added if its key is unique (Key values are u

相关标签:
6条回答
  • 2020-12-25 11:09

    indicator = True in merge command will tell you which join was applied by creating new column _merge with three possible values:

    • left_only
    • right_only
    • both

    Keep right_only and left_only. That is it.

    outer_join = TableA.merge(TableB, how = 'outer', indicator = True)
    
    anti_join = outer_join[~(outer_join._merge == 'both')].drop('_merge', axis = 1)
    
    
    

    easy!

    Here is a comparison with a solution from piRSquared:

    1) When run on this example matching based on one column, piRSquared's solution is faster.

    2) But it only works for matching on one column. If you want to match on several columns - my solution works just as fine as with one column.

    So it's up for you to decide.

    0 讨论(0)
  • 2020-12-25 11:13

    You'll have both tables TableA and TableB such that both DataFrame objects have columns with unique values in their respective tables, but some columns may have values that occur simultaneously (have the same values for a row) in both tables.

    Then, we want to merge the rows in TableA with the rows in TableB that don't match any in TableA for a 'Key' column. The concept is to picture it as comparing two series of variable length, and combining the rows in one series sA with the other sB if sB's values don't match sA's. The following code solves this exercise:

    import pandas as pd
    
    TableA = pd.DataFrame([[2, 3, 4], [5, 6, 7], [8, 9, 10]])
    TableB = pd.DataFrame([[1, 3, 4], [5, 7, 8], [9, 10, 0]])
    
    removeTheseIndexes = []
    keyColumnA = TableA.iloc[:,1] # your 'Key' column here
    keyColumnB = TableB.iloc[:,1] # same
    
    for i in range(0, len(keyColumnA)):
        firstValue = keyColumnA[i]
        for j in range(0, len(keyColumnB)):
            copycat = keyColumnB[j]
            if firstValue == copycat:
                removeTheseIndexes.append(j)
    
    TableB.drop(removeTheseIndexes, inplace = True)
    TableA = TableA.append(TableB)
    TableA = TableA.reset_index(drop=True)
    

    Note this affects TableB's data as well. You can use inplace=False and re-assign it to a newTable, then TableA.append(newTable) alternatively.

    # Table A
       0  1   2
    0  2  3   4
    1  5  6   7
    2  8  9  10
    
    # Table B
       0   1  2
    0  1   3  4
    1  5   7  8
    2  9  10  0
    
    # Set 'Key' column = 1
    # Run the script after the loop
    
    # Table A
       0   1   2
    0  2   3   4
    1  5   6   7
    2  8   9  10
    3  5   7   8
    4  9  10   0
    
    # Table B
       0   1  2
    1  5   7  8
    2  9  10  0
    
    0 讨论(0)
  • 2020-12-25 11:16

    Easiest answer imaginable:

    tableB = pd.concat([tableB, pd.Series(1)], axis=1)
    mergedTable = tableA.merge(tableB, how="left" on="key")
    
    answer = mergedTable[mergedTable.iloc[:,-1].isnull()][tableA.columns.tolist()]
    

    Should be the fastest proposed as well.

    0 讨论(0)
  • 2020-12-25 11:18

    I had the same problem. This answer using how='outer' and indicator=True of merge inspired me to come up with this solution:

    import pandas as pd
    import numpy as np
    
    TableA = pd.DataFrame(np.random.rand(4, 3),
                          pd.Index(list('abcd'), name='Key'),
                          ['A', 'B', 'C']).reset_index()
    TableB = pd.DataFrame(np.random.rand(4, 3),
                          pd.Index(list('aecf'), name='Key'),
                          ['A', 'B', 'C']).reset_index()
    
    print('TableA', TableA, sep='\n')
    print('TableB', TableB, sep='\n')
    
    TableB_only = pd.merge(
        TableA, TableB,
        how='outer', on='Key', indicator=True, suffixes=('_foo','')).query(
            '_merge == "right_only"')
    
    print('TableB_only', TableB_only, sep='\n')
    
    Table_concatenated = pd.concat((TableA, TableB_only), join='inner')
    
    print('Table_concatenated', Table_concatenated, sep='\n')
    

    Which prints this output:

    TableA
      Key         A         B         C
    0   a  0.035548  0.344711  0.860918
    1   b  0.640194  0.212250  0.277359
    2   c  0.592234  0.113492  0.037444
    3   d  0.112271  0.205245  0.227157
    TableB
      Key         A         B         C
    0   a  0.754538  0.692902  0.537704
    1   e  0.499092  0.864145  0.004559
    2   c  0.082087  0.682573  0.421654
    3   f  0.768914  0.281617  0.924693
    TableB_only
      Key  A_foo  B_foo  C_foo         A         B         C      _merge
    4   e    NaN    NaN    NaN  0.499092  0.864145  0.004559  right_only
    5   f    NaN    NaN    NaN  0.768914  0.281617  0.924693  right_only
    Table_concatenated
      Key         A         B         C
    0   a  0.035548  0.344711  0.860918
    1   b  0.640194  0.212250  0.277359
    2   c  0.592234  0.113492  0.037444
    3   d  0.112271  0.205245  0.227157
    4   e  0.499092  0.864145  0.004559
    5   f  0.768914  0.281617  0.924693
    
    0 讨论(0)
  • 2020-12-25 11:29

    Consider the following dataframes

    TableA = pd.DataFrame(np.random.rand(4, 3),
                          pd.Index(list('abcd'), name='Key'),
                          ['A', 'B', 'C']).reset_index()
    TableB = pd.DataFrame(np.random.rand(4, 3),
                          pd.Index(list('aecf'), name='Key'),
                          ['A', 'B', 'C']).reset_index()
    

    TableA
    


    TableB
    

    This is one way to do what you want

    Method 1

    # Identify what values are in TableB and not in TableA
    key_diff = set(TableB.Key).difference(TableA.Key)
    where_diff = TableB.Key.isin(key_diff)
    
    # Slice TableB accordingly and append to TableA
    TableA.append(TableB[where_diff], ignore_index=True)
    

    Method 2

    rows = []
    for i, row in TableB.iterrows():
        if row.Key not in TableA.Key.values:
            rows.append(row)
    
    pd.concat([TableA.T] + rows, axis=1).T
    

    Timing

    4 rows with 2 overlap

    Method 1 is much quicker

    10,000 rows 5,000 overlap

    loops are bad

    0 讨论(0)
  • 2020-12-25 11:30

    Based on one of the other suggestions, here's a function that should do it. Using only pandas functions, no looping. You can use multiple columns as the key as well. If you change the line output = merged.loc[merged.dummy_col.isna(),tableA.columns.tolist()] to output = merged.loc[~merged.dummy_col.isna(),tableA.columns.tolist()] you have a semi_join.

    def anti_join(tableA,tableB,on):
    
       #if joining on index, make it into a column
       if tableB.index.name is not None:
           dummy = tableB.reset_index()[on]
       else:
           dummy = tableB[on]
    
       #create a dummy columns of 1s
       if isinstance(dummy, pd.Series):
           dummy = dummy.to_frame()
    
       dummy.loc[:,'dummy_col'] = 1
    
       #preserve the index of tableA if it has one
       if tableA.index.name is not None:
           idx_name = tableA.index.name
           tableA = tableA.reset_index(drop = False)
       else:
           idx_name = None
    
       #do a left-join
       merged = tableA.merge(dummy,on=on,how='left')
    
       #keep only the non-matches
       output = merged.loc[merged.dummy_col.isna(),tableA.columns.tolist()]
    
       #reset the index (if applicable)
       if idx_name is not None:
           output = output.set_index(idx_name)
    
       return(output)
    
    0 讨论(0)
提交回复
热议问题