Pandas: How to filter for items that occur more than once in a dataframe

前端 未结 1 565
庸人自扰
庸人自扰 2021-02-13 15:27

I have a Pandas DataFrame that contains duplicate entries. Some items are also listed twice or three times. I would like to filter it so that it only shows items that are listed

1条回答
  •  谎友^
    谎友^ (楼主)
    2021-02-13 15:57

    You can use value_counts to get the item count and then construct a boolean mask from this and reference the index and test membership using isin:

    In [3]:
    df = pd.DataFrame({'a':[0,0,0,1,2,2,3,3,3,3,3,3,4,4,4]})
    df
    
    Out[3]:
        a
    0   0
    1   0
    2   0
    3   1
    4   2
    5   2
    6   3
    7   3
    8   3
    9   3
    10  3
    11  3
    12  4
    13  4
    14  4
    
    In [8]:
    df[df['a'].isin(df['a'].value_counts()[df['a'].value_counts()>2].index)]
    
    Out[8]:
        a
    0   0
    1   0
    2   0
    6   3
    7   3
    8   3
    9   3
    10  3
    11  3
    12  4
    13  4
    14  4
    

    So breaking the above down:

    In [9]:
    df['a'].value_counts() > 2
    
    Out[9]:
    3     True
    4     True
    0     True
    2    False
    1    False
    Name: a, dtype: bool
    
    In [10]:
    # construct a boolean mask
    df['a'].value_counts()[df['a'].value_counts()>2]
    
    Out[10]:
    3    6
    4    3
    0    3
    Name: a, dtype: int64
    
    In [11]:
    # we're interested in the index here, pass this to isin
    df['a'].value_counts()[df['a'].value_counts()>2].index
    
    Out[11]:
    Int64Index([3, 4, 0], dtype='int64')
    

    EDIT

    As user @JonClements suggested a simpler and faster method would be to groupby on the col of interest and filter it:

    In [4]:
    df.groupby('a').filter(lambda x: len(x) > 2)
    
    Out[4]:
        a
    0   0
    1   0
    2   0
    6   3
    7   3
    8   3
    9   3
    10  3
    11  3
    12  4
    13  4
    14  4
    

    EDIT 2

    To get just a single entry for each repeat call drop_duplicates and pass param subset='a':

    In [2]:
    df.groupby('a').filter(lambda x: len(x) > 2).drop_duplicates(subset='a')
    
    Out[2]:
        a
    0   0
    6   3
    12  4
    

    0 讨论(0)
提交回复
热议问题