Comparing two pandas dataframes for differences

后端未结

关注

 8  1201

I\'ve got a script updating 5-10 columns worth of data , but sometimes the start csv will be identical to the end csv so instead of writing an identical csvfile I want it to

相关标签:

8条回答

小蘑菇

2020-11-30 03:35
To pull out the symmetric differences:
```
df_diff = pd.concat([df1,df2]).drop_duplicates(keep=False)
```
For example:
```
df1 = pd.DataFrame({
    'num': [1, 4, 3],
    'name': ['a', 'b', 'c'],
})
df2 = pd.DataFrame({
    'num': [1, 2, 3],
    'name': ['a', 'b', 'd'],
})
```
Will yield:

Note: until the next release of pandas, to avoid the warning about how the sort argument will be set in the future, just add the sort=False argument. As below:
```
df_diff = pd.concat([df1,df2], sort=False).drop_duplicates(keep=False)
```
0 讨论(0)
发布评论:

提交评论
- 加载中...
野趣味

2020-11-30 03:37

In my case, I had a weird error, whereby even though the indices, column-names and values were same, the DataFrames didnt match. I tracked it down to the data-types, and it seems pandas can sometimes use different datatypes, resulting in such problems

For example:

param2 = pd.DataFrame({'a': [1]}) param1 = pd.DataFrame({'a': [1], 'b': [2], 'c': [2], 'step': ['alpha']})

if you check param1.dtypes and param2.dtypes, you will find that 'a' is of type object for param1 and is of type int64 for param2. Now, if you do some manipulation using a combination of param1 and param2, other parameters of the dataframe will deviate from the default ones.

So after the final dataframe is generated, even though the actual values that are printed out are same, final_df1.equals(final_df2), may turn out to be not-equal, because those samll parameters like Axis 1, ObjectBlock, IntBlock maynot be the same.

A easy way to get around this and compare the values is to use

final_df1==final_df2.

However, this will do a element by element comparison, so it wont work if you are using it to assert a statement for example in pytest.

TL;DR

What works well is

all(final_df1 == final_df2).

This does a element by element comparison, while neglecting the parameters not important for comparison.

TL;DR2

If your values and indices are same, but final_df1.equals(final_df2) is showing False, you can use final_df1._data and final_df2._data to check the rest of the elements of the dataframes.

0 讨论(0)
发布评论:

提交评论
- 加载中...

上一页 1 2

Comparing two pandas dataframes for differences

TL;DR

TL;DR2