pandas get rows which are NOT in other dataframe

后端 未结 13 839
春和景丽
春和景丽 2020-11-22 02:17

I\'ve two pandas data frames which have some rows in common.

Suppose dataframe2 is a subset of dataframe1.

How can I get the rows of dataframe1 which

13条回答
  •  北恋
    北恋 (楼主)
    2020-11-22 03:02

    The currently selected solution produces incorrect results. To correctly solve this problem, we can perform a left-join from df1 to df2, making sure to first get just the unique rows for df2.

    First, we need to modify the original DataFrame to add the row with data [3, 10].

    df1 = pd.DataFrame(data = {'col1' : [1, 2, 3, 4, 5, 3], 
                               'col2' : [10, 11, 12, 13, 14, 10]}) 
    df2 = pd.DataFrame(data = {'col1' : [1, 2, 3],
                               'col2' : [10, 11, 12]})
    
    df1
    
       col1  col2
    0     1    10
    1     2    11
    2     3    12
    3     4    13
    4     5    14
    5     3    10
    
    df2
    
       col1  col2
    0     1    10
    1     2    11
    2     3    12
    

    Perform a left-join, eliminating duplicates in df2 so that each row of df1 joins with exactly 1 row of df2. Use the parameter indicator to return an extra column indicating which table the row was from.

    df_all = df1.merge(df2.drop_duplicates(), on=['col1','col2'], 
                       how='left', indicator=True)
    df_all
    
       col1  col2     _merge
    0     1    10       both
    1     2    11       both
    2     3    12       both
    3     4    13  left_only
    4     5    14  left_only
    5     3    10  left_only
    

    Create a boolean condition:

    df_all['_merge'] == 'left_only'
    
    0    False
    1    False
    2    False
    3     True
    4     True
    5     True
    Name: _merge, dtype: bool
    

    Why other solutions are wrong

    A few solutions make the same mistake - they only check that each value is independently in each column, not together in the same row. Adding the last row, which is unique but has the values from both columns from df2 exposes the mistake:

    common = df1.merge(df2,on=['col1','col2'])
    (~df1.col1.isin(common.col1))&(~df1.col2.isin(common.col2))
    0    False
    1    False
    2    False
    3     True
    4     True
    5    False
    dtype: bool
    

    This solution gets the same wrong result:

    df1.isin(df2.to_dict('l')).all(1)
    

提交回复
热议问题