Finding count of distinct elements in DataFrame in each column

前端未结

关注

 8  1210

I am trying to find the count of distinct values in each column using Pandas. This is what I did.

import pandas as pd
import numpy as np

# Generate data.
NR


                      
              相关标签:


      
      
        
          8条回答        

        
                         				            
            
           
            
                              
                
              
              
                
                  耶瑟儿～        
                
              
                            
                2020-12-02 18:42
              
            
            
                                                                       
A Pandas.Series has a .value_counts() function that provides exactly what you want to. Check out the documentation for the function.
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  暗喜        
                
              
                            
                2020-12-02 18:43
              
            
            
                                                                       
I found:

df.agg(['nunique']).T


much faster
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  孤街浪徒        
                
              
                            
                2020-12-02 18:47
              
            
            
                                                                       
As of pandas 0.20 we can use nunique directly on DataFrames, i.e.:

df.nunique()
a    4
b    5
c    1
dtype: int64


Other legacy options:

You could do a transpose of the df and then using apply call nunique row-wise:

In [205]:
df = pd.DataFrame({'a':[0,1,1,2,3],'b':[1,2,3,4,5],'c':[1,1,1,1,1]})
df

Out[205]:
   a  b  c
0  0  1  1
1  1  2  1
2  1  3  1
3  2  4  1
4  3  5  1

In [206]:
df.T.apply(lambda x: x.nunique(), axis=1)

Out[206]:
a    4
b    5
c    1
dtype: int64


EDIT

As pointed out by @ajcr the transpose is unnecessary:

In [208]:
df.apply(pd.Series.nunique)

Out[208]:
a    4
b    5
c    1
dtype: int64

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  抹茶落季        
                
              
                            
                2020-12-02 18:50
              
            
            
                                                                       
df.apply(lambda x: len(x.unique()))

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  花落未央        
                
              
                            
                2020-12-02 18:54
              
            
            
                                                                       
Recently, I have same issues of counting unique value of each column in DataFrame, and I found some other function that runs faster than the apply function:

#Select the way how you want to store the output, could be pd.DataFrame or Dict, I will use Dict to demonstrate:
col_uni_val={}
for i in df.columns:
    col_uni_val[i] = len(df[i].unique())

#Import pprint to display dic nicely:
import pprint
pprint.pprint(col_uni_val)


This works for me almost twice faster than df.apply(lambda x: len(x.unique()))
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  隐瞒了意图╮        
                
              
                            
                2020-12-02 18:54
              
            
            
                                                                       
Adding the example code for the answer given by @CaMaDuPe85

df = pd.DataFrame({'a':[0,1,1,2,3],'b':[1,2,3,4,5],'c':[1,1,1,1,1]})
df

# df
    a   b   c
0   0   1   1
1   1   2   1
2   1   3   1
3   2   4   1
4   3   5   1


for cs in df.columns:
    print(cs,df[cs].value_counts().count()) 
    # using value_counts in each column and count it 

# Output

a 4
b 5
c 1

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
   
          
     1
2
下一页
           
           
        
                                  
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复