How to compare two dataframes ignoring column names?

孤街醉人 提交于 2021-02-16 14:35:08

问题


Suppose I want to compare the content of two dataframes, but not the column names (or index names). Is it possible to achieve this without renaming the columns?

For example:

df = pd.DataFrame({'A': [1,2], 'B':[3,4]})
df_equal = pd.DataFrame({'a': [1,2], 'b':[3,4]})
df_diff = pd.DataFrame({'A': [1,2], 'B':[3,5]})

In this case, df is df_equal but different to df_diff, because the values in df_equal has the same content, but the ones in df_diff. Notice that the column names in df_equal are different, but I still want to get a true value.

I have tried the following:

equals:

# Returns false because of the column names
df.equals(df_equal)

eq:

# doesn't work as it compares four columns (A,B,a,b) assuming nulls for the one that doesn't exist
df.eq(df_equal).all().all()

pandas.testing.assert_frame_equal:

# same as equals
pd.testing.assert_frame_equal(df, df_equal, check_names=False)

I thought that it was going to be possible to use the assert_frame_equal, but none of the parameters seem to work to ignore column names.


回答1:


pd.DataFrame is built around pd.Series, so it's unlikely you will be able to perform comparisons without column names.

But the most efficient way would be to drop down to numpy:

assert_equal = (df.values == df_equal.values).all()

To deal with np.nan, you can use np.testing.assert_equal and catch AssertionError, as suggested by @Avaris :

import numpy as np

def nan_equal(a,b):
    try:
        np.testing.assert_equal(a,b)
    except AssertionError:
        return False
    return True

assert_equal = nan_equal(df.values, df_equal.values)



回答2:


I just needed to get the values (numpy array) from the data frame, so the column names won't be considered.

df.eq(df_equal.values).all().all()

I would still like to see a parameter on equals, or assert_frame_equal. Maybe I am missing something.


An advantage of this compared to @jpp answer is that, I can get see which columns do not match, calling only all() only once:

df.eq(df_diff.values).all()
Out[24]: 
A     True
B    False
dtype: bool

One problem is that when eq is used, then np.nan is not equal to np.nan, in which case the following expression, would serve well:

(df.eq(df_equal.values) | (df.isnull().values & df_equal.isnull().values)).all().all()



回答3:


df1 = pd.DataFrame({'A': [1, 2], 'B': [3, 4]})
df2 = pd.DataFrame({'A': [1, 2], 'B': [3, 4]})

for i in range(df1.shape[0]):
    for j in range(df1.shape[1]):
        print(df1.iloc[i, j] == df2.iloc[i, j])

Will return:

True
True
True
True

Same thing for:

df1 = pd.DataFrame({'a': [1, 2], 'b': [3, 4]})
df2 = pd.DataFrame({'A': [1, 2], 'B': [3, 4]})

One obvious issue is that column names matters in Pandas to sort dataframes. For example:

df1 = pd.DataFrame({'a': [1, 2], 'b': [3, 4]})
df2 = pd.DataFrame({'a': [1, 2], 'B': [3, 4]})
print(df1)
print(df2)

renders as ('B' is before 'a' in df2):

   a  b
0  1  3
1  2  4
   B  a
0  3  1
1  4  2


来源:https://stackoverflow.com/questions/49233359/how-to-compare-two-dataframes-ignoring-column-names

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!