Python Pandas - Find difference between two data frames

前端 未结 10 1840
陌清茗
陌清茗 2020-11-22 13:59

I have two data frames df1 and df2, where df2 is a subset of df1. How do I get a new data frame (df3) which is the difference between the two data frames?

In other w

相关标签:
10条回答
  • 2020-11-22 14:49
    import pandas as pd
    # given
    df1 = pd.DataFrame({'Name':['John','Mike','Smith','Wale','Marry','Tom','Menda','Bolt','Yuswa',],
        'Age':[23,45,12,34,27,44,28,39,40]})
    df2 = pd.DataFrame({'Name':['John','Smith','Wale','Tom','Menda','Yuswa',],
        'Age':[23,12,34,44,28,40]})
    
    # find elements in df1 that are not in df2
    df_1notin2 = df1[~(df1['Name'].isin(df2['Name']) & df1['Age'].isin(df2['Age']))].reset_index(drop=True)
    
    # output:
    print('df1\n', df1)
    print('df2\n', df2)
    print('df_1notin2\n', df_1notin2)
    
    # df1
    #     Age   Name
    # 0   23   John
    # 1   45   Mike
    # 2   12  Smith
    # 3   34   Wale
    # 4   27  Marry
    # 5   44    Tom
    # 6   28  Menda
    # 7   39   Bolt
    # 8   40  Yuswa
    # df2
    #     Age   Name
    # 0   23   John
    # 1   12  Smith
    # 2   34   Wale
    # 3   44    Tom
    # 4   28  Menda
    # 5   40  Yuswa
    # df_1notin2
    #     Age   Name
    # 0   45   Mike
    # 1   27  Marry
    # 2   39   Bolt
    
    0 讨论(0)
  • 2020-11-22 14:50

    Finding difference by index. Assuming df1 is a subset of df2 and the indexes are carried forward when subsetting

    df1.loc[set(df1.index).symmetric_difference(set(df2.index))].dropna()
    
    # Example
    
    df1 = pd.DataFrame({"gender":np.random.choice(['m','f'],size=5), "subject":np.random.choice(["bio","phy","chem"],size=5)}, index = [1,2,3,4,5])
    
    df2 =  df1.loc[[1,3,5]]
    
    df1
    
     gender subject
    1      f     bio
    2      m    chem
    3      f     phy
    4      m     bio
    5      f     bio
    
    df2
    
      gender subject
    1      f     bio
    3      f     phy
    5      f     bio
    
    df3 = df1.loc[set(df1.index).symmetric_difference(set(df2.index))].dropna()
    
    df3
    
      gender subject
    2      m    chem
    4      m     bio
    
    
    0 讨论(0)
  • 2020-11-22 14:51

    In addition to accepted answer, I would like to propose one more wider solution that can find a 2D set difference of two dataframes with any index/columns (they might not coincide for both datarames). Also method allows to setup tolerance for float elements for dataframe comparison (it uses np.isclose)

    
    import numpy as np
    import pandas as pd
    
    def get_dataframe_setdiff2d(df_new: pd.DataFrame, 
                                df_old: pd.DataFrame, 
                                rtol=1e-03, atol=1e-05) -> pd.DataFrame:
        """Returns set difference of two pandas DataFrames"""
    
        union_index = np.union1d(df_new.index, df_old.index)
        union_columns = np.union1d(df_new.columns, df_old.columns)
    
        new = df_new.reindex(index=union_index, columns=union_columns)
        old = df_old.reindex(index=union_index, columns=union_columns)
    
        mask_diff = ~np.isclose(new, old, rtol, atol)
    
        df_bool = pd.DataFrame(mask_diff, union_index, union_columns)
    
        df_diff = pd.concat([new[df_bool].stack(),
                             old[df_bool].stack()], axis=1)
    
        df_diff.columns = ["New", "Old"]
    
        return df_diff
    

    Example:

    In [1]
    
    df1 = pd.DataFrame({'A':[2,1,2],'C':[2,1,2]})
    df2 = pd.DataFrame({'A':[1,1],'B':[1,1]})
    
    print("df1:\n", df1, "\n")
    
    print("df2:\n", df2, "\n")
    
    diff = get_dataframe_setdiff2d(df1, df2)
    
    print("diff:\n", diff, "\n")
    
    Out [1]
    
    df1:
       A  C
    0  2  2
    1  1  1
    2  2  2 
    
    df2:
       A  B
    0  1  1
    1  1  1 
    
    diff:
         New  Old
    0 A  2.0  1.0
      B  NaN  1.0
      C  2.0  NaN
    1 B  NaN  1.0
      C  1.0  NaN
    2 A  2.0  NaN
      C  2.0  NaN 
    
    0 讨论(0)
  • 2020-11-22 14:57

    Accepted answer Method 1 will not work for data frames with NaNs inside, as pd.np.nan != pd.np.nan. I am not sure if this is the best way, but it can be avoided by

    df1[~df1.astype(str).apply(tuple, 1).isin(df2.astype(str).apply(tuple, 1))]
    
    0 讨论(0)
提交回复
热议问题