How is pandas groupby method actually working?

前端未结

关注

 1  391

So I was trying to understand pandas.dataFrame.groupby() function and I came across this example on the documentation:

    In [1]: df = pd.DataFrame({\'A\' :


                      
              相关标签:


      
      
        
          1条回答        

        
                         				            
            
           
            
                              
                
              
              
                
                  走了就别回头了        
                
              
                            
                2020-11-29 13:01
              
            
            
                                                                       
When you use just

df.groupby('A')


You get a GroupBy object.  You haven't applied any function to it at that point.  Under the hood, while this definition might not be perfect, you can think of a groupby object as:


An iterator of (group, DataFrame) pairs, for DataFrames, or
An iterator of (group, Series) pairs, for Series.


To illustrate:

df = DataFrame({'A' : [1, 1, 2, 2], 'B' : [1, 2, 3, 4]})
grouped = df.groupby('A')

# each `i` is a tuple of (group, DataFrame)
# so your output here will be a little messy
for i in grouped:
    print(i)
(1,    A  B
0  1  1
1  1  2)
(2,    A  B
2  2  3
3  2  4)

# this version uses multiple counters
# in a single loop.  each `group` is a group, each
# `df` is its corresponding DataFrame
for group, df in grouped:
    print('group of A:', group, '\n')
    print(df, '\n')
group of A: 1 

   A  B
0  1  1
1  1  2 

group of A: 2 

   A  B
2  2  3
3  2  4 

# and if you just wanted to visualize the groups,
# your second counter is a "throwaway"
for group, _ in grouped:
    print('group of A:', group, '\n')
group of A: 1 

group of A: 2 


Now as for .head.  Just have a look at the docs for that method:


  Essentially equivalent to .apply(lambda x: x.head(n))


So here you're actually applying a function to each group of the groupby object.  Keep in mind .head(5) is applied to each group (each DataFrame), so because you have less than or equal to 5 rows per group, you get your original DataFrame.

Consider this with the example above.  If you use .head(1), you get only the first 1 row of each group:

print(df.groupby('A').head(1))
   A  B
0  1  1
2  2  3

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
                             
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复