Find duplicate rows among different groups with pandas

问题

Problem

Consider the following dataframe:

data_so = {
        'ID':  [100, 100, 100, 200, 200, 300, 300, 300],
        'letter': ['A','B','A','C','D','E','D','A'],
}
df_so = pandas.DataFrame (data_so, columns = ['ID', 'letter'])

I want to obtain a new column where all duplicates in different groups are True. All other duplicates in the same group should be False.

What I've tried

I've tried using

df_so['dup'] = df_so.duplicated(subset=['letter'], keep=False)

but the result is not what I want:

The first occurrence of A in group 1 (row 0) is True because there is a duplicate in another group (row 7). However all other occurrences of A in the same group (row 2) should be False.

If row 7 is deleted, then row 0 should be False because A is not present anymore in any other group.

回答1:

What you need is essentially the AND of two different duplicated() calls.

~df_so.duplicated() deals within groups
df_so.drop_duplicates().duplicated(subset='letter',keep=False).fillna(True) Deals between groups ignoring current group duplicates

Code:

import pandas as pd
data_so = { 'ID':  [100, 100, 100, 200, 200, 300, 300, 300], 'letter': ['A','B','A','C','D','E','D','A'], }
df_so = pd.DataFrame (data_so, columns = ['ID', 'letter'])
df_so['dup'] = ~df_so.duplicated() & df_so.drop_duplicates().duplicated(subset='letter',keep=False).fillna(True)
print(df_so)

Output:

    ID letter    dup
0  100      A   True
1  100      B  False
2  100      A  False
3  200      C  False
4  200      D   True
5  300      E  False
6  300      D   True
7  300      A   True

Other case:

data_so = { 'ID':  [100, 100, 100, 200, 200, 300, 300], 'letter': ['A','B','A','C','D','E','D'] }

Output:

    ID letter    dup
0  100      A  False
1  100      B  False
2  100      A  False
3  200      C  False
4  200      D   True
5  300      E  False
6  300      D   True

回答2:

As you clarify in the comment, you need an additional mask beside current duplicated

m1 = df_so.duplicated(subset=['letter'], keep=False)
m2 = ~df_so.groupby('ID').letter.apply(lambda x: x.duplicated())

df_so['dup'] = m1 & m2

Out[157]:
    ID letter    dup
0  100      A   True
1  100      B  False
2  100      A  False
3  200      C  False
4  200      D   True
5  300      E  False
6  300      D   True
7  300      A   True
8  300      A  False

Note: I added row=8 as in the comment.

回答3:

My idea for this problem:

import datatable as dt

df = dt.Frame(df_so)

df[:1, :, dt.by("ID", "letter")]

I would group by both the ID and letter column. Then simply select the first row.

来源：https://stackoverflow.com/questions/64128529/find-duplicate-rows-among-different-groups-with-pandas

标签

python

pandas

pandas-groupby