Remove duplicates based on the content of two columns not the order

问题

I have a correlation matrix that i melted into a dataframe so now i have the following for example:

First      Second       Value
A          B            0.5
B          A            0.5
A          C            0.2

i want to delete only one of the first two rows. What would be the way to do it?

回答1:

Use:

#if want select columns by columns names
m = ~pd.DataFrame(np.sort(df[['First','Second']], axis=1)).duplicated()
#if want select columns by positons
#m = ~pd.DataFrame(np.sort(df.iloc[:,:2], axis=1)).duplicated()
print (m)

0     True
1    False
2     True
dtype: bool

df = df[m]
print (df)
  First Second  Value
0     A      B    0.5
2     A      C    0.2

回答2:

You could call drop_duplicates on the np.sorted columns:

df = df.loc[~pd.DataFrame(np.sort(df.iloc[:, :2])).duplicated()]
df

  First Second  Value
0     A      B    0.5
2     A      C    0.2

Details

np.sort(df.iloc[:, :2])

array([['A', 'B'],
       ['A', 'B'],
       ['A', 'C']], dtype=object)

~pd.DataFrame(np.sort(df.iloc[:, :2], axis=1)).duplicated()

0     True
1    False
2     True
dtype: bool

Sort the columns and figure out which ones are duplicates. The mask will then be used to filter out the dataframe via boolean indexing.

To reset the index, use reset_index:

df.reset_index(drop=1)

  First Second  Value
0     A      B    0.5
1     A      C    0.2

回答3:

One can also use following approach:

# create a new column after merging and sorting 'First' and 'Second':
df['newcol']=df.apply(lambda x: "".join(sorted(x[0]+x[1])), axis=1)
print(df)

  First Second  Value newcol
0     A      B    0.5     AB
1     B      A    0.5     AB
2     A      C    0.2     AC

# get its non-duplicated indexes and remove the new column: 
df = df[~df.newcol.duplicated()].iloc[:,:3]
print(df)

  First Second  Value
0     A      B    0.5
2     A      C    0.2

来源：https://stackoverflow.com/questions/47051854/remove-duplicates-based-on-the-content-of-two-columns-not-the-order

标签

python

pandas

duplicates