unique combinations of values in selected columns in pandas data frame and count

后端未结

关注

 4  615

I have my data in pandas data frame as follows:

df1 = pd.DataFrame({\'A\':[\'yes\',\'yes\',\'yes\',\'yes\',\'no\',\'no\',\'yes\',\'yes\',\'yes\',\'no\'],


                      
              相关标签:


      
      
        
          4条回答        

        
                         				            
            
           
            
                              
                
              
              
                
                  夕颜        
                
              
                            
                2020-11-27 10:59
              
            
            
                                                                       
Slightly related, I was looking for the unique combinations and I came up with this method:

def unique_columns(df,columns):

    result = pd.Series(index = df.index)

    groups = meta_data_csv.groupby(by = columns)
    for name,group in groups:
       is_unique = len(group) == 1
       result.loc[group.index] = is_unique

    assert not result.isnull().any()

    return result


And if you only want to assert that all combinations are unique:

df1.set_index(['A','B']).index.is_unique

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  一向        
                
              
                            
                2020-11-27 11:07
              
            
            
                                                                       
You can groupby on cols 'A' and 'B' and call size and then reset_index and rename the generated column:

In [26]:

df1.groupby(['A','B']).size().reset_index().rename(columns={0:'count'})
Out[26]:
     A    B  count
0   no   no      1
1   no  yes      2
2  yes   no      4
3  yes  yes      3


update

A little explanation, by grouping on the 2 columns, this groups rows where A and B values are the same, we call size which returns the number of unique groups:

In[202]:
df1.groupby(['A','B']).size()

Out[202]: 
A    B  
no   no     1
     yes    2
yes  no     4
     yes    3
dtype: int64


So now to restore the grouped columns, we call reset_index:

In[203]:
df1.groupby(['A','B']).size().reset_index()

Out[203]: 
     A    B  0
0   no   no  1
1   no  yes  2
2  yes   no  4
3  yes  yes  3


This restores the indices but the size aggregation is turned into a generated column 0, so we have to rename this:

In[204]:
df1.groupby(['A','B']).size().reset_index().rename(columns={0:'count'})

Out[204]: 
     A    B  count
0   no   no      1
1   no  yes      2
2  yes   no      4
3  yes  yes      3


groupby does accept the arg as_index which we could have set to False so it doesn't make the grouped columns the index, but this generates a series and you'd still have to restore the indices and so on....:

In[205]:
df1.groupby(['A','B'], as_index=False).size()

Out[205]: 
A    B  
no   no     1
     yes    2
yes  no     4
     yes    3
dtype: int64

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  情歌与酒        
                
              
                            
                2020-11-27 11:07
              
            
            
                                                                       
I haven't done time test with this but it was fun to try.  Basically convert two columns to one column of tuples.  Now convert that to a dataframe, do 'value_counts()' which finds the unique elements and counts them.  Fiddle with zip again and put the columns in order you want.  You can probably make the steps more elegant but working with tuples seems more natural to me for this problem
b = pd.DataFrame({'A':['yes','yes','yes','yes','no','no','yes','yes','yes','no'],'B':['yes','no','no','no','yes','yes','no','yes','yes','no']})

b['count'] = pd.Series(zip(*[b.A,b.B]))
df = pd.DataFrame(b['count'].value_counts().reset_index())
df['A'], df['B'] = zip(*df['index'])
df = df.drop(columns='index')[['A','B','count']]

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  悲哀的现实        
                
              
                            
                2020-11-27 11:14
              
            
            
                                                                       
Placing @EdChum's very nice answer into a function count_unique_index. 
The unique method only works on pandas series, not on data frames. 
The function below reproduces the behavior of the unique function in R:


  unique returns a vector, data frame or array like x but with duplicate elements/rows removed.


And adds a count of the occurrences as requested by the OP.

df1 = pd.DataFrame({'A':['yes','yes','yes','yes','no','no','yes','yes','yes','no'],                                                                                             
                    'B':['yes','no','no','no','yes','yes','no','yes','yes','no']})                                                                                               
def count_unique_index(df, by):                                                                                                                                                 
    return df.groupby(by).size().reset_index().rename(columns={0:'count'})                                                                                                      

count_unique_index(df1, ['A','B'])                                                                                                                                              
     A    B  count                                                                                                                                                                  
0   no   no      1                                                                                                                                                                  
1   no  yes      2                                                                                                                                                                  
2  yes   no      4                                                                                                                                                                  
3  yes  yes      3

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
                             
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复