Pandas get topmost n records within each group

前端未结

关注

 3  1609

Suppose I have pandas DataFrame like this:

>>> df = pd.DataFrame({\'id\':[1,1,1,2,2,2,2,3,4],\'value\':[1,2,3,1,2,3,4,1,1]})
>>> df
   id


                      
              相关标签:


      
      
        
          3条回答        

        
                         				            
            
           
            
                              
                
              
              
                
                  醉酒成梦        
                
              
                            
                2020-11-22 06:46
              
            
            
                                                                       
Did you try df.groupby('id').head(2)

Ouput generated: 

>>> df.groupby('id').head(2)
       id  value
id             
1  0   1      1
   1   1      2 
2  3   2      1
   4   2      2
3  7   3      1
4  8   4      1


(Keep in mind that you might need to order/sort before, depending on your data)

EDIT: As mentioned by the questioner, use df.groupby('id').head(2).reset_index(drop=True) to remove the multindex and flatten the results.

>>> df.groupby('id').head(2).reset_index(drop=True)
    id  value
0   1      1
1   1      2
2   2      1
3   2      2
4   3      1
5   4      1

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  隐瞒了意图╮        
                
              
                            
                2020-11-22 06:47
              
            
            
                                                                       
Sometimes sorting the whole data ahead is very time consuming. 
We can groupby first and doing topk for each group:

g = df.groupby(['id']).apply(lambda x: x.nlargest(topk,['value'])).reset_index(drop=True)

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  眼角桃花        
                
              
                            
                2020-11-22 06:49
              
            
            
                                                                       
Since 0.14.1, you can now do nlargest and nsmallest on a groupby object:

In [23]: df.groupby('id')['value'].nlargest(2)
Out[23]: 
id   
1   2    3
    1    2
2   6    4
    5    3
3   7    1
4   8    1
dtype: int64


There's a slight weirdness that you get the original index in there as well, but this might be really useful depending on what your original index was.

If you're not interested in it, you can do .reset_index(level=1, drop=True) to get rid of it altogether.

(Note: From 0.17.1 you'll be able to do this on a DataFrameGroupBy too but for now it only works with Series and SeriesGroupBy.)
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
                             
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复