问题
Problem
Consider the following dataframe:
data_so = {
'ID': [100, 100, 100, 200, 200, 300, 300, 300],
'letter': ['A','B','A','C','D','E','D','A'],
}
df_so = pandas.DataFrame (data_so, columns = ['ID', 'letter'])
I want to obtain a new column where all duplicates in different groups are True. All other duplicates in the same group should be False.
What I've tried
I've tried using
df_so['dup'] = df_so.duplicated(subset=['letter'], keep=False)
but the result is not what I want:
The first occurrence of A in group 1 (row 0) is True
because there is a duplicate in another group (row 7). However all other occurrences of A in the same group (row 2) should be False
.
If row 7 is deleted, then row 0 should be False
because A is not present anymore in any other group.
回答1:
What you need is essentially the AND
of two different duplicated()
calls.
~df_so.duplicated()
deals within groupsdf_so.drop_duplicates().duplicated(subset='letter',keep=False).fillna(True)
Deals between groups ignoring current group duplicates
Code:
import pandas as pd
data_so = { 'ID': [100, 100, 100, 200, 200, 300, 300, 300], 'letter': ['A','B','A','C','D','E','D','A'], }
df_so = pd.DataFrame (data_so, columns = ['ID', 'letter'])
df_so['dup'] = ~df_so.duplicated() & df_so.drop_duplicates().duplicated(subset='letter',keep=False).fillna(True)
print(df_so)
Output:
ID letter dup
0 100 A True
1 100 B False
2 100 A False
3 200 C False
4 200 D True
5 300 E False
6 300 D True
7 300 A True
Other case:
data_so = { 'ID': [100, 100, 100, 200, 200, 300, 300], 'letter': ['A','B','A','C','D','E','D'] }
Output:
ID letter dup
0 100 A False
1 100 B False
2 100 A False
3 200 C False
4 200 D True
5 300 E False
6 300 D True
回答2:
As you clarify in the comment, you need an additional mask beside current duplicated
m1 = df_so.duplicated(subset=['letter'], keep=False)
m2 = ~df_so.groupby('ID').letter.apply(lambda x: x.duplicated())
df_so['dup'] = m1 & m2
Out[157]:
ID letter dup
0 100 A True
1 100 B False
2 100 A False
3 200 C False
4 200 D True
5 300 E False
6 300 D True
7 300 A True
8 300 A False
Note: I added row=8
as in the comment.
回答3:
My idea for this problem:
import datatable as dt
df = dt.Frame(df_so)
df[:1, :, dt.by("ID", "letter")]
I would group by both the ID and letter column. Then simply select the first row.
来源:https://stackoverflow.com/questions/64128529/find-duplicate-rows-among-different-groups-with-pandas