pandas get rows which are NOT in other dataframe

后端 未结 13 842
春和景丽
春和景丽 2020-11-22 02:17

I\'ve two pandas data frames which have some rows in common.

Suppose dataframe2 is a subset of dataframe1.

How can I get the rows of dataframe1 which

相关标签:
13条回答
  • 2020-11-22 02:53

    Assuming that the indexes are consistent in the dataframes (not taking into account the actual col values):

    df1[~df1.index.isin(df2.index)]
    
    0 讨论(0)
  • 2020-11-22 02:56

    As already hinted at, isin requires columns and indices to be the same for a match. If match should only be on row contents, one way to get the mask for filtering the rows present is to convert the rows to a (Multi)Index:

    In [77]: df1 = pandas.DataFrame(data = {'col1' : [1, 2, 3, 4, 5, 3], 'col2' : [10, 11, 12, 13, 14, 10]})
    In [78]: df2 = pandas.DataFrame(data = {'col1' : [1, 3, 4], 'col2' : [10, 12, 13]})
    In [79]: df1.loc[~df1.set_index(list(df1.columns)).index.isin(df2.set_index(list(df2.columns)).index)]
    Out[79]:
       col1  col2
    1     2    11
    4     5    14
    5     3    10
    

    If index should be taken into account, set_index has keyword argument append to append columns to existing index. If columns do not line up, list(df.columns) can be replaced with column specifications to align the data.

    pandas.MultiIndex.from_tuples(df<N>.to_records(index = False).tolist())
    

    could alternatively be used to create the indices, though I doubt this is more efficient.

    0 讨论(0)
  • 2020-11-22 03:00

    This is the best way to do it:

    df = df1.drop_duplicates().merge(df2.drop_duplicates(), on=df2.columns.to_list(), 
                       how='left', indicator=True)
    df.loc[df._merge=='left_only',df.columns!='_merge']
    

    Note that drop duplicated is used to minimize the comparisons. It would work without them as well. The best way is to compare the row contents themselves and not the index or one/two columns and same code can be used for other filters like 'both' and 'right_only' as well to achieve similar results. For this syntax dataframes can have any number of columns and even different indices. Only the columns should occur in both the dataframes.

    Why this is the best way?

    1. index.difference only works for unique index based comparisons
    2. pandas.concat() coupled with drop_duplicated() is not ideal because it will also get rid of the rows which may be only in the dataframe you want to keep and are duplicated for valid reasons.
    0 讨论(0)
  • 2020-11-22 03:01

    you can do it using isin(dict) method:

    In [74]: df1[~df1.isin(df2.to_dict('l')).all(1)]
    Out[74]:
       col1  col2
    3     4    13
    4     5    14
    

    Explanation:

    In [75]: df2.to_dict('l')
    Out[75]: {'col1': [1, 2, 3], 'col2': [10, 11, 12]}
    
    In [76]: df1.isin(df2.to_dict('l'))
    Out[76]:
        col1   col2
    0   True   True
    1   True   True
    2   True   True
    3  False  False
    4  False  False
    
    In [77]: df1.isin(df2.to_dict('l')).all(1)
    Out[77]:
    0     True
    1     True
    2     True
    3    False
    4    False
    dtype: bool
    
    0 讨论(0)
  • 2020-11-22 03:02

    One method would be to store the result of an inner merge form both dfs, then we can simply select the rows when one column's values are not in this common:

    In [119]:
    
    common = df1.merge(df2,on=['col1','col2'])
    print(common)
    df1[(~df1.col1.isin(common.col1))&(~df1.col2.isin(common.col2))]
       col1  col2
    0     1    10
    1     2    11
    2     3    12
    Out[119]:
       col1  col2
    3     4    13
    4     5    14
    

    EDIT

    Another method as you've found is to use isin which will produce NaN rows which you can drop:

    In [138]:
    
    df1[~df1.isin(df2)].dropna()
    Out[138]:
       col1  col2
    3     4    13
    4     5    14
    

    However if df2 does not start rows in the same manner then this won't work:

    df2 = pd.DataFrame(data = {'col1' : [2, 3,4], 'col2' : [11, 12,13]})
    

    will produce the entire df:

    In [140]:
    
    df1[~df1.isin(df2)].dropna()
    Out[140]:
       col1  col2
    0     1    10
    1     2    11
    2     3    12
    3     4    13
    4     5    14
    
    0 讨论(0)
  • 2020-11-22 03:02

    The currently selected solution produces incorrect results. To correctly solve this problem, we can perform a left-join from df1 to df2, making sure to first get just the unique rows for df2.

    First, we need to modify the original DataFrame to add the row with data [3, 10].

    df1 = pd.DataFrame(data = {'col1' : [1, 2, 3, 4, 5, 3], 
                               'col2' : [10, 11, 12, 13, 14, 10]}) 
    df2 = pd.DataFrame(data = {'col1' : [1, 2, 3],
                               'col2' : [10, 11, 12]})
    
    df1
    
       col1  col2
    0     1    10
    1     2    11
    2     3    12
    3     4    13
    4     5    14
    5     3    10
    
    df2
    
       col1  col2
    0     1    10
    1     2    11
    2     3    12
    

    Perform a left-join, eliminating duplicates in df2 so that each row of df1 joins with exactly 1 row of df2. Use the parameter indicator to return an extra column indicating which table the row was from.

    df_all = df1.merge(df2.drop_duplicates(), on=['col1','col2'], 
                       how='left', indicator=True)
    df_all
    
       col1  col2     _merge
    0     1    10       both
    1     2    11       both
    2     3    12       both
    3     4    13  left_only
    4     5    14  left_only
    5     3    10  left_only
    

    Create a boolean condition:

    df_all['_merge'] == 'left_only'
    
    0    False
    1    False
    2    False
    3     True
    4     True
    5     True
    Name: _merge, dtype: bool
    

    Why other solutions are wrong

    A few solutions make the same mistake - they only check that each value is independently in each column, not together in the same row. Adding the last row, which is unique but has the values from both columns from df2 exposes the mistake:

    common = df1.merge(df2,on=['col1','col2'])
    (~df1.col1.isin(common.col1))&(~df1.col2.isin(common.col2))
    0    False
    1    False
    2    False
    3     True
    4     True
    5    False
    dtype: bool
    

    This solution gets the same wrong result:

    df1.isin(df2.to_dict('l')).all(1)
    
    0 讨论(0)
提交回复
热议问题