问题
Suppose I want to compare the content of two dataframes, but not the column names (or index names). Is it possible to achieve this without renaming the columns?
For example:
df = pd.DataFrame({'A': [1,2], 'B':[3,4]})
df_equal = pd.DataFrame({'a': [1,2], 'b':[3,4]})
df_diff = pd.DataFrame({'A': [1,2], 'B':[3,5]})
In this case, df
is df_equal
but different to df_diff
, because the values in df_equal
has the same content, but the ones in df_diff
. Notice that the column names in df_equal
are different, but I still want to get a true value.
I have tried the following:
equals:
# Returns false because of the column names
df.equals(df_equal)
eq:
# doesn't work as it compares four columns (A,B,a,b) assuming nulls for the one that doesn't exist
df.eq(df_equal).all().all()
pandas.testing.assert_frame_equal:
# same as equals
pd.testing.assert_frame_equal(df, df_equal, check_names=False)
I thought that it was going to be possible to use the assert_frame_equal
, but none of the parameters seem to work to ignore column names.
回答1:
pd.DataFrame
is built around pd.Series
, so it's unlikely you will be able to perform comparisons without column names.
But the most efficient way would be to drop down to numpy
:
assert_equal = (df.values == df_equal.values).all()
To deal with np.nan
, you can use np.testing.assert_equal
and catch AssertionError
, as suggested by @Avaris :
import numpy as np
def nan_equal(a,b):
try:
np.testing.assert_equal(a,b)
except AssertionError:
return False
return True
assert_equal = nan_equal(df.values, df_equal.values)
回答2:
I just needed to get the values (numpy array) from the data frame, so the column names won't be considered.
df.eq(df_equal.values).all().all()
I would still like to see a parameter on equals
, or assert_frame_equal
. Maybe I am missing something.
An advantage of this compared to @jpp answer is that, I can get see which columns do not match, calling only all()
only once:
df.eq(df_diff.values).all()
Out[24]:
A True
B False
dtype: bool
One problem is that when eq is used, then np.nan
is not equal to np.nan
, in which case the following expression, would serve well:
(df.eq(df_equal.values) | (df.isnull().values & df_equal.isnull().values)).all().all()
回答3:
df1 = pd.DataFrame({'A': [1, 2], 'B': [3, 4]})
df2 = pd.DataFrame({'A': [1, 2], 'B': [3, 4]})
for i in range(df1.shape[0]):
for j in range(df1.shape[1]):
print(df1.iloc[i, j] == df2.iloc[i, j])
Will return:
True
True
True
True
Same thing for:
df1 = pd.DataFrame({'a': [1, 2], 'b': [3, 4]})
df2 = pd.DataFrame({'A': [1, 2], 'B': [3, 4]})
One obvious issue is that column names matters in Pandas to sort dataframes. For example:
df1 = pd.DataFrame({'a': [1, 2], 'b': [3, 4]})
df2 = pd.DataFrame({'a': [1, 2], 'B': [3, 4]})
print(df1)
print(df2)
renders as ('B' is before 'a' in df2):
a b
0 1 3
1 2 4
B a
0 3 1
1 4 2
来源:https://stackoverflow.com/questions/49233359/how-to-compare-two-dataframes-ignoring-column-names