Anti-Join Pandas

前端未结

关注

 6  1061

I have two tables and I would like to append them so that only all the data in table A is retained and data from table B is only added if its key is unique (Key values are u

相关标签:

6条回答

再見小時候

2020-12-25 11:09
indicator = True in merge command will tell you which join was applied by creating new column _merge with three possible values:
- left_only
- right_only
- both
Keep right_only and left_only. That is it.
```
outer_join = TableA.merge(TableB, how = 'outer', indicator = True)

anti_join = outer_join[~(outer_join._merge == 'both')].drop('_merge', axis = 1)
```
easy!

Here is a comparison with a solution from piRSquared:

1) When run on this example matching based on one column, piRSquared's solution is faster.

2) But it only works for matching on one column. If you want to match on several columns - my solution works just as fine as with one column.

So it's up for you to decide.
0 讨论(0)
发布评论:

提交评论
- 加载中...

孤街浪徒

2020-12-25 11:13

You'll have both tables TableA and TableB such that both DataFrame objects have columns with unique values in their respective tables, but some columns may have values that occur simultaneously (have the same values for a row) in both tables.

Then, we want to merge the rows in TableA with the rows in TableB that don't match any in TableA for a 'Key' column. The concept is to picture it as comparing two series of variable length, and combining the rows in one series sA with the other sB if sB's values don't match sA's. The following code solves this exercise:

import pandas as pd

TableA = pd.DataFrame([[2, 3, 4], [5, 6, 7], [8, 9, 10]])
TableB = pd.DataFrame([[1, 3, 4], [5, 7, 8], [9, 10, 0]])

removeTheseIndexes = []
keyColumnA = TableA.iloc[:,1] # your 'Key' column here
keyColumnB = TableB.iloc[:,1] # same

for i in range(0, len(keyColumnA)):
    firstValue = keyColumnA[i]
    for j in range(0, len(keyColumnB)):
        copycat = keyColumnB[j]
        if firstValue == copycat:
            removeTheseIndexes.append(j)

TableB.drop(removeTheseIndexes, inplace = True)
TableA = TableA.append(TableB)
TableA = TableA.reset_index(drop=True)

Note this affects TableB's data as well. You can use inplace=False and re-assign it to a newTable, then TableA.append(newTable) alternatively.

# Table A
   0  1   2
0  2  3   4
1  5  6   7
2  8  9  10

# Table B
   0   1  2
0  1   3  4
1  5   7  8
2  9  10  0

# Set 'Key' column = 1
# Run the script after the loop

# Table A
   0   1   2
0  2   3   4
1  5   6   7
2  8   9  10
3  5   7   8
4  9  10   0

# Table B
   0   1  2
1  5   7  8
2  9  10  0

0 讨论(0)

甜味超标

2020-12-25 11:16

Easiest answer imaginable:

tableB = pd.concat([tableB, pd.Series(1)], axis=1)
mergedTable = tableA.merge(tableB, how="left" on="key")

answer = mergedTable[mergedTable.iloc[:,-1].isnull()][tableA.columns.tolist()]

Should be the fastest proposed as well.

0 讨论(0)

南旧

2020-12-25 11:18

I had the same problem. This answer using how='outer' and indicator=True of merge inspired me to come up with this solution:

import pandas as pd
import numpy as np

TableA = pd.DataFrame(np.random.rand(4, 3),
                      pd.Index(list('abcd'), name='Key'),
                      ['A', 'B', 'C']).reset_index()
TableB = pd.DataFrame(np.random.rand(4, 3),
                      pd.Index(list('aecf'), name='Key'),
                      ['A', 'B', 'C']).reset_index()

print('TableA', TableA, sep='\n')
print('TableB', TableB, sep='\n')

TableB_only = pd.merge(
    TableA, TableB,
    how='outer', on='Key', indicator=True, suffixes=('_foo','')).query(
        '_merge == "right_only"')

print('TableB_only', TableB_only, sep='\n')

Table_concatenated = pd.concat((TableA, TableB_only), join='inner')

print('Table_concatenated', Table_concatenated, sep='\n')

Which prints this output:

TableA
  Key         A         B         C
0   a  0.035548  0.344711  0.860918
1   b  0.640194  0.212250  0.277359
2   c  0.592234  0.113492  0.037444
3   d  0.112271  0.205245  0.227157
TableB
  Key         A         B         C
0   a  0.754538  0.692902  0.537704
1   e  0.499092  0.864145  0.004559
2   c  0.082087  0.682573  0.421654
3   f  0.768914  0.281617  0.924693
TableB_only
  Key  A_foo  B_foo  C_foo         A         B         C      _merge
4   e    NaN    NaN    NaN  0.499092  0.864145  0.004559  right_only
5   f    NaN    NaN    NaN  0.768914  0.281617  0.924693  right_only
Table_concatenated
  Key         A         B         C
0   a  0.035548  0.344711  0.860918
1   b  0.640194  0.212250  0.277359
2   c  0.592234  0.113492  0.037444
3   d  0.112271  0.205245  0.227157
4   e  0.499092  0.864145  0.004559
5   f  0.768914  0.281617  0.924693

0 讨论(0)

不思量自难忘°

2020-12-25 11:29

Consider the following dataframes

TableA = pd.DataFrame(np.random.rand(4, 3),
                      pd.Index(list('abcd'), name='Key'),
                      ['A', 'B', 'C']).reset_index()
TableB = pd.DataFrame(np.random.rand(4, 3),
                      pd.Index(list('aecf'), name='Key'),
                      ['A', 'B', 'C']).reset_index()

TableA

TableB

This is one way to do what you want

Method 1

# Identify what values are in TableB and not in TableA
key_diff = set(TableB.Key).difference(TableA.Key)
where_diff = TableB.Key.isin(key_diff)

# Slice TableB accordingly and append to TableA
TableA.append(TableB[where_diff], ignore_index=True)

Method 2

rows = []
for i, row in TableB.iterrows():
    if row.Key not in TableA.Key.values:
        rows.append(row)

pd.concat([TableA.T] + rows, axis=1).T

Timing

4 rows with 2 overlap

Method 1 is much quicker

10,000 rows 5,000 overlap

loops are bad

0 讨论(0)

故里飘歌

2020-12-25 11:30

Based on one of the other suggestions, here's a function that should do it. Using only pandas functions, no looping. You can use multiple columns as the key as well. If you change the line output = merged.loc[merged.dummy_col.isna(),tableA.columns.tolist()] to output = merged.loc[~merged.dummy_col.isna(),tableA.columns.tolist()] you have a semi_join.

def anti_join(tableA,tableB,on):

   #if joining on index, make it into a column
   if tableB.index.name is not None:
       dummy = tableB.reset_index()[on]
   else:
       dummy = tableB[on]

   #create a dummy columns of 1s
   if isinstance(dummy, pd.Series):
       dummy = dummy.to_frame()

   dummy.loc[:,'dummy_col'] = 1

   #preserve the index of tableA if it has one
   if tableA.index.name is not None:
       idx_name = tableA.index.name
       tableA = tableA.reset_index(drop = False)
   else:
       idx_name = None

   #do a left-join
   merged = tableA.merge(dummy,on=on,how='left')

   #keep only the non-matches
   output = merged.loc[merged.dummy_col.isna(),tableA.columns.tolist()]

   #reset the index (if applicable)
   if idx_name is not None:
       output = output.set_index(idx_name)

   return(output)

0 讨论(0)