Pandas: Selecting rows for which groupby.sum() satisfies condition

前端未结

关注

 3  769

In pandas I have a dataframe of the form:

>>> import pandas as pd  
>>> df = pd.DataFrame({\'ID\':[51,51,51,24,24,24,31], \'x\':[0,1,0,0,1,


                      
              相关标签:


      
      
        
          3条回答        

        
                         				            
            
           
            
                              
                
              
              
                
                  庸人自扰        
                
              
                            
                2021-01-19 05:53
              
            
            
                                                                       
Use groupby and filter

df.groupby('ID').filter(lambda s: s.x.sum()>=2)


Output:

   ID  x
3  24  0
4  24  1
5  24  1

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  广开言路        
                
              
                            
                2021-01-19 06:04
              
            
            
                                                                       
df = pd.DataFrame({'ID':[51,51,51,24,24,24,31], 'x':[0,1,0,0,1,1,0]})
df.loc[df.groupby(['ID'])['x'].transform(func=sum)>=2,:]
out:
   ID  x
3  24  0
4  24  1
5  24  1

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  不要未来只要你来        
                
              
                            
                2021-01-19 06:04
              
            
            
                                                                       
Using np.bincount and pd.factorize

alternative advance technique to draw better performance

f, u = df.ID.factorize()
df[np.bincount(f, df.x.values)[f] >= 2]

   ID  x
3  24  0
4  24  1
5  24  1




In obnoxious one-liner form

df[(lambda f, w: np.bincount(f, w)[f] >= 2)(df.ID.factorize()[0], df.x.values)]

   ID  x
3  24  0
4  24  1
5  24  1




np.bincount and np.unique

I could've used np.unique with the return_inverse parameter to accomplish the same exact thing.  But, np.unique will sort the array and will change the time complexity of the solution.  

u, f = np.unique(df.ID.values, return_inverse=True)
df[np.bincount(f, df.x.values)[f] >= 2]




One-liner

df[(lambda f, w: np.bincount(f, w)[f] >= 2)(np.unique(df.ID.values, return_inverse=True)[1], df.x.values)]




Timing  

%timeit df[(lambda f, w: np.bincount(f, w)[f] >= 2)(df.ID.factorize()[0], df.x.values)]
%timeit df[(lambda f, w: np.bincount(f, w)[f] >= 2)(np.unique(df.ID.values, return_inverse=True)[1], df.x.values)]
%timeit df.groupby('ID').filter(lambda s: s.x.sum()>=2)
%timeit df.loc[df.groupby(['ID'])['x'].transform(func=sum)>=2]
%timeit df.loc[df.groupby(['ID'])['x'].transform('sum')>=2]


small data

1000 loops, best of 3: 302 µs per loop
1000 loops, best of 3: 241 µs per loop
1000 loops, best of 3: 1.52 ms per loop
1000 loops, best of 3: 1.2 ms per loop
1000 loops, best of 3: 1.21 ms per loop


large data  

np.random.seed([3,1415])
df = pd.DataFrame(dict(
        ID=np.random.randint(100, size=10000),
        x=np.random.randint(2, size=10000)
    ))

1000 loops, best of 3: 528 µs per loop
1000 loops, best of 3: 847 µs per loop
10 loops, best of 3: 20.9 ms per loop
1000 loops, best of 3: 1.47 ms per loop
1000 loops, best of 3: 1.55 ms per loop


larger data  

np.random.seed([3,1415])
df = pd.DataFrame(dict(
        ID=np.random.randint(100, size=100000),
        x=np.random.randint(2, size=100000)
    ))

1000 loops, best of 3: 2.01 ms per loop
100 loops, best of 3: 6.44 ms per loop
10 loops, best of 3: 29.4 ms per loop
100 loops, best of 3: 3.84 ms per loop
100 loops, best of 3: 3.74 ms per loop

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
                             
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复