Remove reverse duplicates from dataframe

后端 未结 3 467
伪装坚强ぢ
伪装坚强ぢ 2020-11-30 12:50

I have a data frame with two columns, A and B. The order of A and B is unimportant in this context; for example, I would

相关标签:
3条回答
  • 2020-11-30 13:16

    You can sort each row of the data frame before dropping the duplicates:

    data.apply(lambda r: sorted(r), axis = 1).drop_duplicates()
    
    #   A    B
    #0  0   50
    #1  10  22
    #2  11  35
    #3  5   21
    

    If you prefer the result to be sorted by column A:

    data.apply(lambda r: sorted(r), axis = 1).drop_duplicates().sort_values('A')
    
    #   A    B
    #0  0   50
    #3  5   21
    #1  10  22
    #2  11  35
    
    0 讨论(0)
  • 2020-11-30 13:21

    Here is bit uglier, but faster solution:

    In [44]: pd.DataFrame(np.sort(data.values, axis=1), columns=data.columns).drop_duplicates()
    Out[44]:
        A   B
    0   0  50
    1  10  22
    2  11  35
    3   5  21
    

    Timing: for 8K rows DF

    In [50]: big = pd.concat([data] * 10**3, ignore_index=True)
    
    In [51]: big.shape
    Out[51]: (8000, 2)
    
    In [52]: %timeit big.apply(lambda r: sorted(r), axis = 1).drop_duplicates()
    1 loop, best of 3: 3.04 s per loop
    
    In [53]: %timeit pd.DataFrame(np.sort(big.values, axis=1), columns=big.columns).drop_duplicates()
    100 loops, best of 3: 3.96 ms per loop
    
    In [59]: %timeit big.apply(np.sort, axis = 1).drop_duplicates()
    1 loop, best of 3: 2.69 s per loop
    
    0 讨论(0)
  • 2020-11-30 13:31

    Now this solution works,

    data.set_index(['A','B']).stack().drop_duplicates().unstack().reset_index()
    

    More columns could be added as well as per necessity. e.g.

    data.set_index(['A','B', 'C']).stack().drop_duplicates().unstack().reset_index()
    
    0 讨论(0)
提交回复
热议问题