I\'ve got a script updating 5-10 columns worth of data , but sometimes the start csv will be identical to the end csv so instead of writing an identical csvfile I want it to
To pull out the symmetric differences:
df_diff = pd.concat([df1,df2]).drop_duplicates(keep=False)
For example:
df1 = pd.DataFrame({
'num': [1, 4, 3],
'name': ['a', 'b', 'c'],
})
df2 = pd.DataFrame({
'num': [1, 2, 3],
'name': ['a', 'b', 'd'],
})
Will yield:
Note: until the next release of pandas, to avoid the warning about how the sort argument will be set in the future, just add the sort=False
argument. As below:
df_diff = pd.concat([df1,df2], sort=False).drop_duplicates(keep=False)
In my case, I had a weird error, whereby even though the indices, column-names
and values were same, the DataFrames
didnt match. I tracked it down to the
data-types, and it seems pandas
can sometimes use different datatypes,
resulting in such problems
For example:
param2 = pd.DataFrame({'a': [1]})
param1 = pd.DataFrame({'a': [1], 'b': [2], 'c': [2], 'step': ['alpha']})
if you check param1.dtypes
and param2.dtypes
, you will find that 'a' is of
type object
for param1
and is of type int64
for param2
. Now, if you do
some manipulation using a combination of param1
and param2
, other
parameters of the dataframe will deviate from the default ones.
So after the final dataframe is generated, even though the actual values that
are printed out are same, final_df1.equals(final_df2)
, may turn out to be
not-equal, because those samll parameters like Axis 1
, ObjectBlock
,
IntBlock
maynot be the same.
A easy way to get around this and compare the values is to use
final_df1==final_df2
.
However, this will do a element by element comparison, so it wont work if you
are using it to assert a statement for example in pytest
.
What works well is
all(final_df1 == final_df2)
.
This does a element by element comparison, while neglecting the parameters not important for comparison.
If your values and indices are same, but final_df1.equals(final_df2)
is showing False
, you can use final_df1._data
and final_df2._data
to check the rest of the elements of the dataframes.