Comparing two pandas dataframes for differences

后端 未结 8 1201
感情败类
感情败类 2020-11-30 02:53

I\'ve got a script updating 5-10 columns worth of data , but sometimes the start csv will be identical to the end csv so instead of writing an identical csvfile I want it to

相关标签:
8条回答
  • 2020-11-30 03:35

    To pull out the symmetric differences:

    df_diff = pd.concat([df1,df2]).drop_duplicates(keep=False)
    

    For example:

    df1 = pd.DataFrame({
        'num': [1, 4, 3],
        'name': ['a', 'b', 'c'],
    })
    df2 = pd.DataFrame({
        'num': [1, 2, 3],
        'name': ['a', 'b', 'd'],
    })
    

    Will yield:

    Note: until the next release of pandas, to avoid the warning about how the sort argument will be set in the future, just add the sort=False argument. As below:

    df_diff = pd.concat([df1,df2], sort=False).drop_duplicates(keep=False)
    
    0 讨论(0)
  • 2020-11-30 03:37

    In my case, I had a weird error, whereby even though the indices, column-names and values were same, the DataFrames didnt match. I tracked it down to the data-types, and it seems pandas can sometimes use different datatypes, resulting in such problems

    For example:

    param2 = pd.DataFrame({'a': [1]}) param1 = pd.DataFrame({'a': [1], 'b': [2], 'c': [2], 'step': ['alpha']})

    if you check param1.dtypes and param2.dtypes, you will find that 'a' is of type object for param1 and is of type int64 for param2. Now, if you do some manipulation using a combination of param1 and param2, other parameters of the dataframe will deviate from the default ones.

    So after the final dataframe is generated, even though the actual values that are printed out are same, final_df1.equals(final_df2), may turn out to be not-equal, because those samll parameters like Axis 1, ObjectBlock, IntBlock maynot be the same.

    A easy way to get around this and compare the values is to use

    final_df1==final_df2.

    However, this will do a element by element comparison, so it wont work if you are using it to assert a statement for example in pytest.

    TL;DR

    What works well is

    all(final_df1 == final_df2).

    This does a element by element comparison, while neglecting the parameters not important for comparison.

    TL;DR2

    If your values and indices are same, but final_df1.equals(final_df2) is showing False, you can use final_df1._data and final_df2._data to check the rest of the elements of the dataframes.

    0 讨论(0)
提交回复
热议问题