How to remove a row from pandas dataframe based on the length of the column values?

后端未结

关注

 5  1939

In the following pandas.DataFframe:

df = 
    alfa    beta   ceta
    a,b,c   c,d,e  g,e,h
    a,b     d,e,f  g,h,k
    j,k     c,k,l  f,k,n


                      
              相关标签:


      
      
        
          5条回答        

        
                         				            
            
           
            
                              
                
              
              
                
                  我寻月下人不归        
                
              
                            
                2021-01-17 10:54
              
            
            
                                                                       
You can do that test to each row in turn using pandas.DataFrame.apply()

print(df[df['alfa'].apply(lambda x: len(x.split(',')) < 3)])


Gives:

  alfa   beta   ceta
1  a,b  d,e,f  g,h,k
2  j,k  c,k,l  f,k,n

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  天涯浪人        
                
              
                            
                2021-01-17 10:54
              
            
            
                                                                       
How's this?

df = df[df['alpha'].str.split(',', expand=True)[2].isnull()]


Using expand=True creates a new dataframe with one column for each item in the list. If the list has three or more items, then the third column will have a non-null value. 

One problem with this approach is that if none of the lists have three or more items, selecting column [2] will cause a KeyError. Based on this, it's safer to use the solution posted by @Stephen Rauch.
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  我寻月下人不归        
                
              
                            
                2021-01-17 10:56
              
            
            
                                                                       
There are at-least two ways to subset the given DF:

1) Split on the comma separator and then compute length of the resulting list:

df[df['alfa'].str.split(",").str.len().lt(3)]


2) Count number of commas and add 1 to the result to account for the last character:

df[df['alfa'].str.count(",").add(1).lt(3)] 


Both produce:


                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  [愿得一人]        
                
              
                            
                2021-01-17 11:01
              
            
            
                                                                       
This is the numpy version of @NickilMaveli's answer.

mask = np.core.defchararray.count(df.alfa.values.astype(str), ',') <= 1
pd.DataFrame(df.values[mask], df.index[mask], df.columns)

  alfa   beta   ceta
1  a,b  d,e,f  g,h,k
2  j,k  c,k,l  f,k,n




naive timing  


                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  北荒        
                
              
                            
                2021-01-17 11:10
              
            
            
                                                                       
Here is an option that is the easiest to remember and still embracing the DataFrame which is the "bleeding heart" of Pandas:

1) Create a new column in the dataframe with a value for the length: 

df['length'] = df.alfa.str.len()


2) Index using the new column:

df = df[df.length < 3]


Then the comparison to the above timings, which are not really relevant in this case as the data is very small, and usually is less important than how likely you're going to remember how to do something and not having to interrupt your workflow: 

step 1: 

%timeit df['length'] = df.alfa.str.len()


359 µs ± 6.83 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

step 2: 

df = df[df.length < 3]


627 µs ± 76.9 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

The good news is that when the size grows, time does not grow linearly. For example doing the same operation with 30,000 rows of data takes about 3ms (so 10,000x data, 3x speed increase). Pandas DataFrame is like a train, takes energy to get it going (so not great for small things under absolute comparison, but objectively does not matter much does it...as with small data things are fast anyways). 
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
                             
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复