pandas get rows which are NOT in other dataframe

后端 未结 13 869
春和景丽
春和景丽 2020-11-22 02:17

I\'ve two pandas data frames which have some rows in common.

Suppose dataframe2 is a subset of dataframe1.

How can I get the rows of dataframe1 which

13条回答
  •  死守一世寂寞
    2020-11-22 03:00

    This is the best way to do it:

    df = df1.drop_duplicates().merge(df2.drop_duplicates(), on=df2.columns.to_list(), 
                       how='left', indicator=True)
    df.loc[df._merge=='left_only',df.columns!='_merge']
    

    Note that drop duplicated is used to minimize the comparisons. It would work without them as well. The best way is to compare the row contents themselves and not the index or one/two columns and same code can be used for other filters like 'both' and 'right_only' as well to achieve similar results. For this syntax dataframes can have any number of columns and even different indices. Only the columns should occur in both the dataframes.

    Why this is the best way?

    1. index.difference only works for unique index based comparisons
    2. pandas.concat() coupled with drop_duplicated() is not ideal because it will also get rid of the rows which may be only in the dataframe you want to keep and are duplicated for valid reasons.

提交回复
热议问题