Intersect a dataframe with a larger one that includes it and remove common rows

折月煮酒 提交于 2020-04-18 04:03:09

问题


I have two dataframes:

df_small = pd.DataFrame(np.array([[1, 2, 3], 
                                  [4, 5, 6], 
                                  [7, 8, 9]]),
                     columns=['a', 'b', 'c'])

and

df_large = pd.DataFrame(np.array([[22, 1, 2, 3, 99], 
                                  [31, 4, 5, 6, 75], 
                                  [73, 7, 8, 9, 23],
                                  [16, 2, 1, 2, 13],
                                  [17, 1, 4, 3, 25],
                                  [93, 3, 2, 8, 18]]),
                     columns=['k', 'a', 'b', 'c', 'd'])

Now what I want is to intersect the two and only take the rows in df_large that that do not contain the rows from df_small, hence the result should be:

df_result = pd.DataFrame(np.array([[16, 2, 1, 2, 13],
                                   [17, 1, 4, 3, 25],
                                   [93, 3, 2, 8, 18]]),
                     columns=['k', 'a', 'b', 'c', 'd'])

回答1:


Use DataFrame.merge with indicator=True and left join and because error is necessary remove duplicates by DataFrame.drop_duplicates from df_small:

m = df_large.merge(df_small.drop_duplicates(), how='left', indicator=True)['_merge'].ne('both')
df = df_large[m]
print (df)
    k  a  b  c   d
3  16  2  1  2  13
4  17  1  4  3  25
5  93  3  2  8  18

Another solution is very similar, only filtered by query and last removed column _merge:

df = (df_large.merge(df_small.drop_duplicates(), how='left', indicator=True)
              .query('_merge != "both"')
              .drop('_merge', axis=1))



回答2:


Use DataFrame.merge:

df_large.merge(df_small,how='outer',indicator=True).query('_merge == "left_only"').drop('_merge', axis=1)

Output:

    k  a  b  c   d
3  16  2  1  2  13
4  17  1  4  3  25
5  93  3  2  8  18



回答3:


You can evade merging and make your code a bit more readable. It's really not that clear what happens when you merge and drop duplicates. Indexes and Multiindexes were made for intersections and other set operations.

common_columns = df_large.columns.intersection(df_small.columns).to_list()
df_small_as_Multiindex = pd.MultiIndex.from_frame(df_small)
df_result = df_large.set_index(common_columns).\ 
        drop(index = df_small_as_Multiindex).\ #Drop the common rows
        reset_index() #Not needed if the a,b,c columns are meaningful indexes


来源:https://stackoverflow.com/questions/58302280/intersect-a-dataframe-with-a-larger-one-that-includes-it-and-remove-common-rows

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!