Python Pandas - Find difference between two data frames

前端 未结 10 1839
陌清茗
陌清茗 2020-11-22 13:59

I have two data frames df1 and df2, where df2 is a subset of df1. How do I get a new data frame (df3) which is the difference between the two data frames?

In other w

相关标签:
10条回答
  • 2020-11-22 14:32

    edit2, I figured out a new solution without the need of setting index

    newdf=pd.concat[df1,df2].drop_duplicates(keep=False)
    

    okay i found the answer of hightest vote already contain what i have figured out .Yes, we can only use this code on condition that there are no duplicates in each two dfs.


    I have a tricky method.First we set ’Name’ as the index of two dataframe given by the question.Since we have same ’Name’ in two dfs,we can just drop the ’smaller’ df’s index from the ‘bigger’ df. Here is the code.

    df1.set_index('Name',inplace=True)
    df2.set_index('Name',inplace=True)
    newdf=df1.drop(df2.index)
    
    0 讨论(0)
  • 2020-11-22 14:32

    Perhaps a simpler one-liner, with identical or different column names. Worked even when df2['Name2'] contained duplicate values.

    newDf = df1.set_index('Name1')
               .drop(df2['Name2'], errors='ignore')
               .reset_index(drop=False)
    
    0 讨论(0)
  • 2020-11-22 14:38

    A slight variation of the nice @liangli's solution that does not require to change the index of existing dataframes:

    newdf = df1.drop(df1.join(df2.set_index('Name').index))
    
    0 讨论(0)
  • 2020-11-22 14:43

    For rows, try this, where Name is the joint index column (can be a list for multiple common columns, or specify left_on and right_on):

    m = df1.merge(df2, on='Name', how='outer', suffixes=['', '_'], indicator=True)
    

    The indicator=True setting is useful as it adds a column called _merge, with all changes between df1 and df2, categorized into 3 possible kinds: "left_only", "right_only" or "both".

    For columns, try this:

    set(df1.columns).symmetric_difference(df2.columns)
    
    0 讨论(0)
  • 2020-11-22 14:45

    By using drop_duplicates

    pd.concat([df1,df2]).drop_duplicates(keep=False)
    

    Update :

    Above method only working for those dataframes they do not have duplicate itself, For example

    df1=pd.DataFrame({'A':[1,2,3,3],'B':[2,3,4,4]})
    df2=pd.DataFrame({'A':[1],'B':[2]})
    

    It will output like below , which is wrong

    Wrong Output :

    pd.concat([df1, df2]).drop_duplicates(keep=False)
    Out[655]: 
       A  B
    1  2  3
    

    Correct Output

    Out[656]: 
       A  B
    1  2  3
    2  3  4
    3  3  4
    

    How to achieve that?

    Method 1: Using isin with tuple

    df1[~df1.apply(tuple,1).isin(df2.apply(tuple,1))]
    Out[657]: 
       A  B
    1  2  3
    2  3  4
    3  3  4
    

    Method 2: merge with indicator

    df1.merge(df2,indicator = True, how='left').loc[lambda x : x['_merge']!='both']
    Out[421]: 
       A  B     _merge
    1  2  3  left_only
    2  3  4  left_only
    3  3  4  left_only
    
    0 讨论(0)
  • 2020-11-22 14:45

    As mentioned here that

    df1[~df1.apply(tuple,1).isin(df2.apply(tuple,1))]
    

    is correct solution but it will produce wrong output if

    df1=pd.DataFrame({'A':[1],'B':[2]})
    df2=pd.DataFrame({'A':[1,2,3,3],'B':[2,3,4,4]})
    

    In that case above solution will give Empty DataFrame, instead you should use concat method after removing duplicates from each datframe.

    Use concate with drop_duplicates

    df1=df1.drop_duplicates(keep="first") 
    df2=df2.drop_duplicates(keep="first") 
    pd.concat([df1,df2]).drop_duplicates(keep=False)
    
    0 讨论(0)
提交回复
热议问题