Python Pandas: remove entries based on the number of occurrences

后端未结

关注

 4  1694

I\'m trying to remove entries from a data frame which occur less than 100 times. The data frame data looks like this:

pid   tag
1     23    
1


                      
              相关标签:


      
      
        
          4条回答        

        
                         				            
            
           
            
                              
                
              
              
                
                  旧巷少年郎        
                
              
                            
                2020-12-05 07:49
              
            
            
                                                                       
Here are some run times for a couple of the solutions posted here, along with one that was not (using value_counts()) that is much faster than the other solutions:

Create the data:

import pandas as pd
import numpy as np

# Generate some 'users'
np.random.seed(42)
df = pd.DataFrame({'uid': np.random.randint(0, 500, 500)})

# Prove that some entries are 1
print "{:,} users only occur once in dataset".format(sum(df.uid.value_counts() == 1))


Output:

171 users only occur once in dataset

Time a few different ways of removing users with only one entry. These were run in separate cells in a Jupyter Notebook:

%%timeit
df.groupby(by='uid').filter(lambda x: len(x) > 1)

%%timeit
df[df.groupby('uid').uid.transform(len) > 1]

%%timeit
vc = df.uid.value_counts()
df[df.uid.isin(vc.index[vc.values > 1])].uid.value_counts()


These gave the following outputs:

10 loops, best of 3: 46.2 ms per loop
10 loops, best of 3: 30.1 ms per loop
1000 loops, best of 3: 1.27 ms per loop

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  清酒与你        
                
              
                            
                2020-12-05 07:53
              
            
            
                                                                       
Edit: Thanks to @WesMcKinney for showing this much more direct way:

data[data.groupby('tag').pid.transform(len) > 1]




import pandas
import numpy as np
data = pandas.DataFrame(
    {'pid' : [1,1,1,2,2,3,3,3],
     'tag' : [23,45,62,24,45,34,25,62],
     })

bytag = data.groupby('tag').aggregate(np.count_nonzero)
tags = bytag[bytag.pid >= 2].index
print(data[data['tag'].isin(tags)])


yields

   pid  tag
1    1   45
2    1   62
4    2   45
7    3   62

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  被撕碎了的回忆        
                
              
                            
                2020-12-05 08:10
              
            
            
                                                                       
New in 0.12, groupby objects have a filter method, allowing you to do these types of operations:

In [11]: g = data.groupby('tag')

In [12]: g.filter(lambda x: len(x) > 1)  # pandas 0.13.1
Out[12]:
   pid  tag
1    1   45
2    1   62
4    2   45
7    3   62


The function (the first argument of filter) is applied to each group (subframe), and the results include elements of the original DataFrame belonging to groups which evaluated to True.

Note: in 0.12 the ordering is different than in the original DataFrame, this was fixed in 0.13+:

In [21]: g.filter(lambda x: len(x) > 1)  # pandas 0.12
Out[21]: 
   pid  tag
1    1   45
4    2   45
2    1   62
7    3   62

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  北恋        
                
              
                            
                2020-12-05 08:12
              
            
            
                                                                       
df = pd.DataFrame([(1, 2), (1, 3), (1, 4), (2, 1),(2,2,)], columns=['col1', 'col2'])

In [36]: df
Out[36]: 
   col1  col2
0     1     2
1     1     3
2     1     4
3     2     1
4     2     2

gp = df.groupby('col1').aggregate(np.count_nonzero)

In [38]: gp
Out[38]: 
      col2
col1      
1        3
2        2


lets get where the count > 2

tf = gp[gp.col2 > 2].reset_index()
df[df.col1 == tf.col1]

Out[41]: 
   col1  col2
0     1     2
1     1     3
2     1     4

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
                             
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复