I have two data frames df1 and df2, where df2 is a subset of df1. How do I get a new data frame (df3) which is the difference between the two data frames?
In other w
edit2, I figured out a new solution without the need of setting index
newdf=pd.concat[df1,df2].drop_duplicates(keep=False)
okay i found the answer of hightest vote already contain what i have figured out .Yes, we can only use this code on condition that there are no duplicates in each two dfs.
I have a tricky method.First we set ’Name’ as the index of two dataframe given by the question.Since we have same ’Name’ in two dfs,we can just drop the ’smaller’ df’s index from the ‘bigger’ df. Here is the code.
df1.set_index('Name',inplace=True)
df2.set_index('Name',inplace=True)
newdf=df1.drop(df2.index)
Perhaps a simpler one-liner, with identical or different column names. Worked even when df2['Name2'] contained duplicate values.
newDf = df1.set_index('Name1')
.drop(df2['Name2'], errors='ignore')
.reset_index(drop=False)
A slight variation of the nice @liangli's solution that does not require to change the index of existing dataframes:
newdf = df1.drop(df1.join(df2.set_index('Name').index))
For rows, try this, where Name
is the joint index column (can be a list for multiple common columns, or specify left_on
and right_on
):
m = df1.merge(df2, on='Name', how='outer', suffixes=['', '_'], indicator=True)
The indicator=True
setting is useful as it adds a column called _merge
, with all changes between df1
and df2
, categorized into 3 possible kinds: "left_only", "right_only" or "both".
For columns, try this:
set(df1.columns).symmetric_difference(df2.columns)
By using drop_duplicates
pd.concat([df1,df2]).drop_duplicates(keep=False)
Update :
Above method only working for those dataframes they do not have duplicate itself, For example
df1=pd.DataFrame({'A':[1,2,3,3],'B':[2,3,4,4]})
df2=pd.DataFrame({'A':[1],'B':[2]})
It will output like below , which is wrong
Wrong Output :
pd.concat([df1, df2]).drop_duplicates(keep=False)
Out[655]:
A B
1 2 3
Correct Output
Out[656]:
A B
1 2 3
2 3 4
3 3 4
How to achieve that?
Method 1: Using isin
with tuple
df1[~df1.apply(tuple,1).isin(df2.apply(tuple,1))]
Out[657]:
A B
1 2 3
2 3 4
3 3 4
Method 2: merge
with indicator
df1.merge(df2,indicator = True, how='left').loc[lambda x : x['_merge']!='both']
Out[421]:
A B _merge
1 2 3 left_only
2 3 4 left_only
3 3 4 left_only
As mentioned here that
df1[~df1.apply(tuple,1).isin(df2.apply(tuple,1))]
is correct solution but it will produce wrong output if
df1=pd.DataFrame({'A':[1],'B':[2]})
df2=pd.DataFrame({'A':[1,2,3,3],'B':[2,3,4,4]})
In that case above solution will give
Empty DataFrame, instead you should use concat
method after removing duplicates from each datframe.
Use concate with drop_duplicates
df1=df1.drop_duplicates(keep="first")
df2=df2.drop_duplicates(keep="first")
pd.concat([df1,df2]).drop_duplicates(keep=False)