Expressing pandas subset using pipe

后端未结

关注

 3  2242

I have a dataframe that I subset like this:

   a  b   x  y
0  1  2   3 -1
1  2  4   6 -2
2  3  6   6 -3
3  4  8   3 -4

df = df[(df.a >= 2) & (df.b <=


                      
              相关标签:


      
      
        
          3条回答        

        
                         				            
            
           
            
                              
                
              
              
                
                  无人共我        
                
              
                            
                2021-02-15 06:26
              
            
            
                                                                       
As long as you can categorize a step as something that returns a DataFrame, and takes a DataFrame (with possibly more arguments), then you can use pipe. Whether there's an advantage to doing so, is another question.

Here, e.g., you can use

df\
    .pipe(lambda df_, x, y: df_[(df_.a >= x) & (df_.b <= y)], 2, 8)\
    .pipe(lambda df_: df_.groupby(df_.x))\
    .mean()


Notice how the first stage is a lambda that takes 3 arguments, with the 2 and 8 passed as parameters. That's not the only way to do so - it is equivalent to 

    .pipe(lambda df_: df_[(df_.a >= 2) & (df_.b <= 8)])\


Also note that you can use

df\
    .pipe(lambda df_, x, y: df[(df.a >= x) & (df.b <= y)], 2, 8)\
    .groupby('x')\
    .mean()


Here the lambda takes df_, but operates on df, and the second pipe has been replaced with a groupby.


The first change works here, but is gragile. It happens to work since this is the first pipe stage. If it would be a later stage, it might take a DataFrame with one dimension, and attempt to filter it on a mask with another dimension, for example.
The second change is fine. In face, I think it is more readable. Basically, anything that takes a DataFrame and returns one, can be either be called directly or through pipe.

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  北荒        
                
              
                            
                2021-02-15 06:38
              
            
            
                                                                       
You can try, but I think it is more complicated:

print df[(df.a >= 2) & (df.b <= 8)].groupby(df.x).mean()
     a  b  x    y
x                
3  4.0  8  3 -4.0
6  2.5  5  6 -2.5


def masker(df, mask):
    return df[mask]

mask1 = (df.a >= 2)
mask2 = (df.b <= 8)     

print df.pipe(masker, mask1).pipe(masker, mask2).groupby(df.x).mean()
     a  b  x    y
x                
3  4.0  8  3 -4.0
6  2.5  5  6 -2.5

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  傲寒        
                
              
                            
                2021-02-15 06:41
              
            
            
                                                                       
I believe this method is clear with regard to your filtering steps and subsequent operations.  Using loc[(mask1) & (mask2)] is probably more performant, however.

>>> (df
     .pipe(lambda x: x.loc[x.a >= 2])
     .pipe(lambda x: x.loc[x.b <= 8])
     .pipe(pd.DataFrame.groupby, 'x')
     .mean()
     )

     a  b    y
x             
3  4.0  8 -4.0
6  2.5  5 -2.5


Alternatively:

(df
 .pipe(lambda x: x.loc[x.a >= 2])
 .pipe(lambda x: x.loc[x.b <= 8])
 .groupby('x')
 .mean()
 )

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
                             
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复