Remove reverse duplicates from dataframe

后端未结

关注

 3  467

I have a data frame with two columns, A and B. The order of A and B is unimportant in this context; for example, I would

相关标签:

3条回答

闹比i

2020-11-30 13:16

You can sort each row of the data frame before dropping the duplicates:

data.apply(lambda r: sorted(r), axis = 1).drop_duplicates()

#   A    B
#0  0   50
#1  10  22
#2  11  35
#3  5   21

If you prefer the result to be sorted by column A:

data.apply(lambda r: sorted(r), axis = 1).drop_duplicates().sort_values('A')

#   A    B
#0  0   50
#3  5   21
#1  10  22
#2  11  35

0 讨论(0)

一生所求

2020-11-30 13:21

Here is bit uglier, but faster solution:

In [44]: pd.DataFrame(np.sort(data.values, axis=1), columns=data.columns).drop_duplicates()
Out[44]:
    A   B
0   0  50
1  10  22
2  11  35
3   5  21

Timing: for 8K rows DF

In [50]: big = pd.concat([data] * 10**3, ignore_index=True)

In [51]: big.shape
Out[51]: (8000, 2)

In [52]: %timeit big.apply(lambda r: sorted(r), axis = 1).drop_duplicates()
1 loop, best of 3: 3.04 s per loop

In [53]: %timeit pd.DataFrame(np.sort(big.values, axis=1), columns=big.columns).drop_duplicates()
100 loops, best of 3: 3.96 ms per loop

In [59]: %timeit big.apply(np.sort, axis = 1).drop_duplicates()
1 loop, best of 3: 2.69 s per loop

0 讨论(0)

长发绾君心

2020-11-30 13:31
Now this solution works,
```
data.set_index(['A','B']).stack().drop_duplicates().unstack().reset_index()
```
More columns could be added as well as per necessity. e.g.
```
data.set_index(['A','B', 'C']).stack().drop_duplicates().unstack().reset_index()
```
0 讨论(0)
发布评论:

提交评论
- 加载中...