问题
I have two dataframes:
df_small = pd.DataFrame(np.array([[1, 2, 3],
[4, 5, 6],
[7, 8, 9]]),
columns=['a', 'b', 'c'])
and
df_large = pd.DataFrame(np.array([[22, 1, 2, 3, 99],
[31, 4, 5, 6, 75],
[73, 7, 8, 9, 23],
[16, 2, 1, 2, 13],
[17, 1, 4, 3, 25],
[93, 3, 2, 8, 18]]),
columns=['k', 'a', 'b', 'c', 'd'])
Now what I want is to intersect the two and only take the rows in df_large
that that do not contain the rows from df_small
, hence the result should be:
df_result = pd.DataFrame(np.array([[16, 2, 1, 2, 13],
[17, 1, 4, 3, 25],
[93, 3, 2, 8, 18]]),
columns=['k', 'a', 'b', 'c', 'd'])
回答1:
Use DataFrame.merge with indicator=True
and left join
and because error is necessary remove duplicates by DataFrame.drop_duplicates from df_small
:
m = df_large.merge(df_small.drop_duplicates(), how='left', indicator=True)['_merge'].ne('both')
df = df_large[m]
print (df)
k a b c d
3 16 2 1 2 13
4 17 1 4 3 25
5 93 3 2 8 18
Another solution is very similar, only filtered by query
and last removed column _merge
:
df = (df_large.merge(df_small.drop_duplicates(), how='left', indicator=True)
.query('_merge != "both"')
.drop('_merge', axis=1))
回答2:
Use DataFrame.merge:
df_large.merge(df_small,how='outer',indicator=True).query('_merge == "left_only"').drop('_merge', axis=1)
Output:
k a b c d
3 16 2 1 2 13
4 17 1 4 3 25
5 93 3 2 8 18
回答3:
You can evade merging and make your code a bit more readable. It's really not that clear what happens when you merge and drop duplicates. Indexes and Multiindexes were made for intersections and other set operations.
common_columns = df_large.columns.intersection(df_small.columns).to_list()
df_small_as_Multiindex = pd.MultiIndex.from_frame(df_small)
df_result = df_large.set_index(common_columns).\
drop(index = df_small_as_Multiindex).\ #Drop the common rows
reset_index() #Not needed if the a,b,c columns are meaningful indexes
来源:https://stackoverflow.com/questions/58302280/intersect-a-dataframe-with-a-larger-one-that-includes-it-and-remove-common-rows