I have a data frame with two columns, A
and B
. The order of A
and B
is unimportant in this context; for example, I would
You can sort each row of the data frame before dropping the duplicates:
data.apply(lambda r: sorted(r), axis = 1).drop_duplicates()
# A B
#0 0 50
#1 10 22
#2 11 35
#3 5 21
If you prefer the result to be sorted by column A
:
data.apply(lambda r: sorted(r), axis = 1).drop_duplicates().sort_values('A')
# A B
#0 0 50
#3 5 21
#1 10 22
#2 11 35
Here is bit uglier, but faster solution:
In [44]: pd.DataFrame(np.sort(data.values, axis=1), columns=data.columns).drop_duplicates()
Out[44]:
A B
0 0 50
1 10 22
2 11 35
3 5 21
Timing: for 8K rows DF
In [50]: big = pd.concat([data] * 10**3, ignore_index=True)
In [51]: big.shape
Out[51]: (8000, 2)
In [52]: %timeit big.apply(lambda r: sorted(r), axis = 1).drop_duplicates()
1 loop, best of 3: 3.04 s per loop
In [53]: %timeit pd.DataFrame(np.sort(big.values, axis=1), columns=big.columns).drop_duplicates()
100 loops, best of 3: 3.96 ms per loop
In [59]: %timeit big.apply(np.sort, axis = 1).drop_duplicates()
1 loop, best of 3: 2.69 s per loop
Now this solution works,
data.set_index(['A','B']).stack().drop_duplicates().unstack().reset_index()
More columns could be added as well as per necessity. e.g.
data.set_index(['A','B', 'C']).stack().drop_duplicates().unstack().reset_index()