Comparing two pandas dataframes for differences

后端 未结 8 1200
感情败类
感情败类 2020-11-30 02:53

I\'ve got a script updating 5-10 columns worth of data , but sometimes the start csv will be identical to the end csv so instead of writing an identical csvfile I want it to

相关标签:
8条回答
  • 2020-11-30 03:20

    Not sure if this is helpful or not, but I whipped together this quick python method for returning just the differences between two dataframes that both have the same columns and shape.

    def get_different_rows(source_df, new_df):
        """Returns just the rows from the new dataframe that differ from the source dataframe"""
        merged_df = source_df.merge(new_df, indicator=True, how='outer')
        changed_rows_df = merged_df[merged_df['_merge'] == 'right_only']
        return changed_rows_df.drop('_merge', axis=1)
    
    0 讨论(0)
  • 2020-11-30 03:23

    Not sure if this existed at the time the question was posted, but pandas now has a built-in function to test equality between two dataframes: http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.equals.html.

    0 讨论(0)
  • 2020-11-30 03:24

    A more accurate comparison should check for index names separately, because DataFrame.equals does not test for that. All the other properties (index values (single/multiindex), values, columns, dtypes) are checked by it correctly.

    df1 = pd.DataFrame([[1, 'a'], [2, 'b'], [3, 'c']], columns=['num', 'name'])
    df1 = df1.set_index('name')
    df2 = pd.DataFrame([[1, 'a'], [2, 'b'], [3, 'c']], columns=['num', 'another_name'])
    df2 = df2.set_index('another_name')
    
    df1.equals(df2)
    True
    
    df1.index.names == df2.index.names
    False
    

    Note: using index.names instead of index.name makes it work for multi-indexed dataframes as well.

    0 讨论(0)
  • 2020-11-30 03:27

    This compares the values of two dataframes note the number of row/columns needs to be the same between tables

    comparison_array = table.values == expected_table.values
    print (comparison_array)
    
    >>>[[True, True, True]
        [True, False, True]]
    
    if False in comparison_array:
        print ("Not the same")
    
    #Return the position of the False values
    np.where(comparison_array==False)
    
    >>>(array([1]), array([1]))
    

    You could then use this index information to return the value that does not match between tables. Since it's zero indexed, it's referring to the 2nd array in the 2nd position which is correct.

    0 讨论(0)
  • 2020-11-30 03:28

    You also need to be careful to create a copy of the DataFrame, otherwise the csvdata_old will be updated with csvdata (since it points to the same object):

    csvdata_old = csvdata.copy()
    

    To check whether they are equal, you can use assert_frame_equal as in this answer:

    from pandas.util.testing import assert_frame_equal
    assert_frame_equal(csvdata, csvdata_old)
    

    You can wrap this in a function with something like:

    try:
        assert_frame_equal(csvdata, csvdata_old)
        return True
    except:  # appeantly AssertionError doesn't catch all
        return False
    

    There was discussion of a better way...

    0 讨论(0)
  • 2020-11-30 03:29

    Check using: df_1.equals(df_2) # Returns True or False, details herebelow

    In [45]: import numpy as np
    
    In [46]: import pandas as pd
    
    In [47]: np.random.seed(5)
    
    In [48]: df_1= pd.DataFrame(np.random.randn(3,3))
    
    In [49]: df_1
    Out[49]: 
              0         1         2
    0  0.441227 -0.330870  2.430771
    1 -0.252092  0.109610  1.582481
    2 -0.909232 -0.591637  0.187603
    
    In [50]: np.random.seed(5)
    
    In [51]: df_2= pd.DataFrame(np.random.randn(3,3))
    
    In [52]: df_2
    Out[52]: 
              0         1         2
    0  0.441227 -0.330870  2.430771
    1 -0.252092  0.109610  1.582481
    2 -0.909232 -0.591637  0.187603
    
    In [53]: df_1.equals(df_2)
    Out[53]: True
    
    
    In [54]: df_3= pd.DataFrame(np.random.randn(3,3))
    
    In [55]: df_3
    Out[55]: 
              0         1         2
    0 -0.329870 -1.192765 -0.204877
    1 -0.358829  0.603472 -1.664789
    2 -0.700179  1.151391  1.857331
    
    In [56]: df_1.equals(df_3)
    Out[56]: False
    
    0 讨论(0)
提交回复
热议问题