pandas get rows which are NOT in other dataframe

后端未结

关注

 13  872

春和景丽

I\'ve two pandas data frames which have some rows in common.

Suppose dataframe2 is a subset of dataframe1.

How can I get the rows of dataframe1 which

相关标签:

13条回答

情书的邮戳

2020-11-22 02:53

Assuming that the indexes are consistent in the dataframes (not taking into account the actual col values):

df1[~df1.index.isin(df2.index)]

0 讨论(0)

发布评论:

提交评论

加载中...

青春惊慌失措

2020-11-22 02:56

As already hinted at, isin requires columns and indices to be the same for a match. If match should only be on row contents, one way to get the mask for filtering the rows present is to convert the rows to a (Multi)Index:

In [77]: df1 = pandas.DataFrame(data = {'col1' : [1, 2, 3, 4, 5, 3], 'col2' : [10, 11, 12, 13, 14, 10]}) In [78]: df2 = pandas.DataFrame(data = {'col1' : [1, 3, 4], 'col2' : [10, 12, 13]}) In [79]: df1.loc[~df1.set_index(list(df1.columns)).index.isin(df2.set_index(list(df2.columns)).index)] Out[79]: col1 col2 1 2 11 4 5 14 5 3 10

If index should be taken into account, set_index has keyword argument append to append columns to existing index. If columns do not line up, list(df.columns) can be replaced with column specifications to align the data.

pandas.MultiIndex.from_tuples(df<N>.to_records(index = False).tolist())

could alternatively be used to create the indices, though I doubt this is more efficient.

0 讨论(0)

发布评论:

提交评论

加载中...

死守一世寂寞

2020-11-22 03:00

This is the best way to do it:

df = df1.drop_duplicates().merge(df2.drop_duplicates(), on=df2.columns.to_list(), how='left', indicator=True) df.loc[df._merge=='left_only',df.columns!='_merge']

Note that drop duplicated is used to minimize the comparisons. It would work without them as well. The best way is to compare the row contents themselves and not the index or one/two columns and same code can be used for other filters like 'both' and 'right_only' as well to achieve similar results. For this syntax dataframes can have any number of columns and even different indices. Only the columns should occur in both the dataframes.

Why this is the best way?

index.difference only works for unique index based comparisons

pandas.concat() coupled with drop_duplicated() is not ideal because it will also get rid of the rows which may be only in the dataframe you want to keep and are duplicated for valid reasons.

0 讨论(0)

发布评论:

提交评论

加载中...

灰色年华

2020-11-22 03:01

you can do it using isin(dict) method:

In [74]: df1[~df1.isin(df2.to_dict('l')).all(1)] Out[74]: col1 col2 3 4 13 4 5 14

Explanation:

In [75]: df2.to_dict('l') Out[75]: {'col1': [1, 2, 3], 'col2': [10, 11, 12]} In [76]: df1.isin(df2.to_dict('l')) Out[76]: col1 col2 0 True True 1 True True 2 True True 3 False False 4 False False In [77]: df1.isin(df2.to_dict('l')).all(1) Out[77]: 0 True 1 True 2 True 3 False 4 False dtype: bool

0 讨论(0)

发布评论:

提交评论

加载中...

萌比男神i

2020-11-22 03:02

One method would be to store the result of an inner merge form both dfs, then we can simply select the rows when one column's values are not in this common:

In [119]: common = df1.merge(df2,on=['col1','col2']) print(common) df1[(~df1.col1.isin(common.col1))&(~df1.col2.isin(common.col2))] col1 col2 0 1 10 1 2 11 2 3 12 Out[119]: col1 col2 3 4 13 4 5 14

EDIT

Another method as you've found is to use isin which will produce NaN rows which you can drop:

In [138]: df1[~df1.isin(df2)].dropna() Out[138]: col1 col2 3 4 13 4 5 14

However if df2 does not start rows in the same manner then this won't work:

df2 = pd.DataFrame(data = {'col1' : [2, 3,4], 'col2' : [11, 12,13]})

will produce the entire df:

In [140]: df1[~df1.isin(df2)].dropna() Out[140]: col1 col2 0 1 10 1 2 11 2 3 12 3 4 13 4 5 14

0 讨论(0)

发布评论:

提交评论

加载中...

北恋

2020-11-22 03:02

The currently selected solution produces incorrect results. To correctly solve this problem, we can perform a left-join from df1 to df2, making sure to first get just the unique rows for df2.

First, we need to modify the original DataFrame to add the row with data [3, 10].

df1 = pd.DataFrame(data = {'col1' : [1, 2, 3, 4, 5, 3], 'col2' : [10, 11, 12, 13, 14, 10]}) df2 = pd.DataFrame(data = {'col1' : [1, 2, 3], 'col2' : [10, 11, 12]}) df1 col1 col2 0 1 10 1 2 11 2 3 12 3 4 13 4 5 14 5 3 10 df2 col1 col2 0 1 10 1 2 11 2 3 12

Perform a left-join, eliminating duplicates in df2 so that each row of df1 joins with exactly 1 row of df2. Use the parameter indicator to return an extra column indicating which table the row was from.

df_all = df1.merge(df2.drop_duplicates(), on=['col1','col2'], how='left', indicator=True) df_all col1 col2 _merge 0 1 10 both 1 2 11 both 2 3 12 both 3 4 13 left_only 4 5 14 left_only 5 3 10 left_only

Create a boolean condition:

df_all['_merge'] == 'left_only' 0 False 1 False 2 False 3 True 4 True 5 True Name: _merge, dtype: bool

Why other solutions are wrong

A few solutions make the same mistake - they only check that each value is independently in each column, not together in the same row. Adding the last row, which is unique but has the values from both columns from df2 exposes the mistake:

common = df1.merge(df2,on=['col1','col2']) (~df1.col1.isin(common.col1))&(~df1.col2.isin(common.col2)) 0 False 1 False 2 False 3 True 4 True 5 False dtype: bool

This solution gets the same wrong result:

df1.isin(df2.to_dict('l')).all(1)

0 讨论(0)

发布评论:

提交评论

加载中...

上一页 1 2 3 下一页

验证码

看不清?

提交回复