Python Pandas: remove entries based on the number of occurrences

后端 未结 4 1694
走了就别回头了
走了就别回头了 2020-12-05 07:36

I\'m trying to remove entries from a data frame which occur less than 100 times. The data frame data looks like this:

pid   tag
1     23    
1          


        
相关标签:
4条回答
  • 2020-12-05 07:49

    Here are some run times for a couple of the solutions posted here, along with one that was not (using value_counts()) that is much faster than the other solutions:

    Create the data:

    import pandas as pd
    import numpy as np
    
    # Generate some 'users'
    np.random.seed(42)
    df = pd.DataFrame({'uid': np.random.randint(0, 500, 500)})
    
    # Prove that some entries are 1
    print "{:,} users only occur once in dataset".format(sum(df.uid.value_counts() == 1))
    

    Output:

    171 users only occur once in dataset

    Time a few different ways of removing users with only one entry. These were run in separate cells in a Jupyter Notebook:

    %%timeit
    df.groupby(by='uid').filter(lambda x: len(x) > 1)
    
    %%timeit
    df[df.groupby('uid').uid.transform(len) > 1]
    
    %%timeit
    vc = df.uid.value_counts()
    df[df.uid.isin(vc.index[vc.values > 1])].uid.value_counts()
    

    These gave the following outputs:

    10 loops, best of 3: 46.2 ms per loop
    10 loops, best of 3: 30.1 ms per loop
    1000 loops, best of 3: 1.27 ms per loop
    
    0 讨论(0)
  • 2020-12-05 07:53

    Edit: Thanks to @WesMcKinney for showing this much more direct way:

    data[data.groupby('tag').pid.transform(len) > 1]
    

    import pandas
    import numpy as np
    data = pandas.DataFrame(
        {'pid' : [1,1,1,2,2,3,3,3],
         'tag' : [23,45,62,24,45,34,25,62],
         })
    
    bytag = data.groupby('tag').aggregate(np.count_nonzero)
    tags = bytag[bytag.pid >= 2].index
    print(data[data['tag'].isin(tags)])
    

    yields

       pid  tag
    1    1   45
    2    1   62
    4    2   45
    7    3   62
    
    0 讨论(0)
  • 2020-12-05 08:10

    New in 0.12, groupby objects have a filter method, allowing you to do these types of operations:

    In [11]: g = data.groupby('tag')
    
    In [12]: g.filter(lambda x: len(x) > 1)  # pandas 0.13.1
    Out[12]:
       pid  tag
    1    1   45
    2    1   62
    4    2   45
    7    3   62
    

    The function (the first argument of filter) is applied to each group (subframe), and the results include elements of the original DataFrame belonging to groups which evaluated to True.

    Note: in 0.12 the ordering is different than in the original DataFrame, this was fixed in 0.13+:

    In [21]: g.filter(lambda x: len(x) > 1)  # pandas 0.12
    Out[21]: 
       pid  tag
    1    1   45
    4    2   45
    2    1   62
    7    3   62
    
    0 讨论(0)
  • 2020-12-05 08:12
    df = pd.DataFrame([(1, 2), (1, 3), (1, 4), (2, 1),(2,2,)], columns=['col1', 'col2'])
    
    In [36]: df
    Out[36]: 
       col1  col2
    0     1     2
    1     1     3
    2     1     4
    3     2     1
    4     2     2
    
    gp = df.groupby('col1').aggregate(np.count_nonzero)
    
    In [38]: gp
    Out[38]: 
          col2
    col1      
    1        3
    2        2
    

    lets get where the count > 2

    tf = gp[gp.col2 > 2].reset_index()
    df[df.col1 == tf.col1]
    
    Out[41]: 
       col1  col2
    0     1     2
    1     1     3
    2     1     4
    
    0 讨论(0)
提交回复
热议问题