Python Pandas - Find difference between two data frames

前端未结

关注

 10  1839

I have two data frames df1 and df2, where df2 is a subset of df1. How do I get a new data frame (df3) which is the difference between the two data frames?

In other w

相关标签:

10条回答

不思量自难忘°

2020-11-22 14:32
edit2, I figured out a new solution without the need of setting index
```
newdf=pd.concat[df1,df2].drop_duplicates(keep=False)
```
okay i found the answer of hightest vote already contain what i have figured out .Yes, we can only use this code on condition that there are no duplicates in each two dfs.

I have a tricky method.First we set ’Name’ as the index of two dataframe given by the question.Since we have same ’Name’ in two dfs,we can just drop the ’smaller’ df’s index from the ‘bigger’ df. Here is the code.
```
df1.set_index('Name',inplace=True)
df2.set_index('Name',inplace=True)
newdf=df1.drop(df2.index)
```
0 讨论(0)
发布评论:

提交评论
- 加载中...
無奈伤痛

2020-11-22 14:32
Perhaps a simpler one-liner, with identical or different column names. Worked even when df2['Name2'] contained duplicate values.
```
newDf = df1.set_index('Name1')
           .drop(df2['Name2'], errors='ignore')
           .reset_index(drop=False)
```
0 讨论(0)
发布评论:

提交评论
- 加载中...
春和景丽

2020-11-22 14:38
A slight variation of the nice @liangli's solution that does not require to change the index of existing dataframes:
```
newdf = df1.drop(df1.join(df2.set_index('Name').index))
```
0 讨论(0)
发布评论:

提交评论
- 加载中...
Happy的楠姐

2020-11-22 14:43
For rows, try this, where Name is the joint index column (can be a list for multiple common columns, or specify left_on and right_on):
```
m = df1.merge(df2, on='Name', how='outer', suffixes=['', '_'], indicator=True)
```
The indicator=True setting is useful as it adds a column called _merge, with all changes between df1 and df2, categorized into 3 possible kinds: "left_only", "right_only" or "both".

For columns, try this:
```
set(df1.columns).symmetric_difference(df2.columns)
```
0 讨论(0)
发布评论:

提交评论
- 加载中...

忘了有多久

2020-11-22 14:45

By using drop_duplicates

pd.concat([df1,df2]).drop_duplicates(keep=False)

Update :

Above method only working for those dataframes they do not have duplicate itself, For example

df1=pd.DataFrame({'A':[1,2,3,3],'B':[2,3,4,4]})
df2=pd.DataFrame({'A':[1],'B':[2]})

It will output like below , which is wrong

Wrong Output :

pd.concat([df1, df2]).drop_duplicates(keep=False)
Out[655]: 
   A  B
1  2  3

Correct Output

How to achieve that?

Method 1: Using isin with tuple

df1[~df1.apply(tuple,1).isin(df2.apply(tuple,1))]
Out[657]: 
   A  B
1  2  3
2  3  4
3  3  4

Method 2: merge with indicator

df1.merge(df2,indicator = True, how='left').loc[lambda x : x['_merge']!='both']
Out[421]: 
   A  B     _merge
1  2  3  left_only
2  3  4  left_only
3  3  4  left_only

0 讨论(0)

长发绾君心

2020-11-22 14:45
As mentioned here that
```
df1[~df1.apply(tuple,1).isin(df2.apply(tuple,1))]
```
is correct solution but it will produce wrong output if
```
df1=pd.DataFrame({'A':[1],'B':[2]})
df2=pd.DataFrame({'A':[1,2,3,3],'B':[2,3,4,4]})
```
In that case above solution will give Empty DataFrame, instead you should use concat method after removing duplicates from each datframe.

Use concate with drop_duplicates
```
df1=df1.drop_duplicates(keep="first") 
df2=df2.drop_duplicates(keep="first") 
pd.concat([df1,df2]).drop_duplicates(keep=False)
```
0 讨论(0)
发布评论:

提交评论
- 加载中...

1 2 下一页