问题
I have a correlation matrix that i melted into a dataframe so now i have the following for example:
First Second Value
A B 0.5
B A 0.5
A C 0.2
i want to delete only one of the first two rows. What would be the way to do it?
回答1:
Use:
#if want select columns by columns names
m = ~pd.DataFrame(np.sort(df[['First','Second']], axis=1)).duplicated()
#if want select columns by positons
#m = ~pd.DataFrame(np.sort(df.iloc[:,:2], axis=1)).duplicated()
print (m)
0 True
1 False
2 True
dtype: bool
df = df[m]
print (df)
First Second Value
0 A B 0.5
2 A C 0.2
回答2:
You could call drop_duplicates on the np.sorted columns:
df = df.loc[~pd.DataFrame(np.sort(df.iloc[:, :2])).duplicated()]
df
First Second Value
0 A B 0.5
2 A C 0.2
Details
np.sort(df.iloc[:, :2])
array([['A', 'B'],
['A', 'B'],
['A', 'C']], dtype=object)
~pd.DataFrame(np.sort(df.iloc[:, :2], axis=1)).duplicated()
0 True
1 False
2 True
dtype: bool
Sort the columns and figure out which ones are duplicates. The mask will then be used to filter out the dataframe via boolean indexing.
To reset the index, use reset_index:
df.reset_index(drop=1)
First Second Value
0 A B 0.5
1 A C 0.2
回答3:
One can also use following approach:
# create a new column after merging and sorting 'First' and 'Second':
df['newcol']=df.apply(lambda x: "".join(sorted(x[0]+x[1])), axis=1)
print(df)
First Second Value newcol
0 A B 0.5 AB
1 B A 0.5 AB
2 A C 0.2 AC
# get its non-duplicated indexes and remove the new column:
df = df[~df.newcol.duplicated()].iloc[:,:3]
print(df)
First Second Value
0 A B 0.5
2 A C 0.2
来源:https://stackoverflow.com/questions/47051854/remove-duplicates-based-on-the-content-of-two-columns-not-the-order