unique combinations of values in selected columns in pandas data frame and count

后端 未结 4 615
无人及你
无人及你 2020-11-27 10:19

I have my data in pandas data frame as follows:

df1 = pd.DataFrame({\'A\':[\'yes\',\'yes\',\'yes\',\'yes\',\'no\',\'no\',\'yes\',\'yes\',\'yes\',\'no\'],
            


        
相关标签:
4条回答
  • 2020-11-27 10:59

    Slightly related, I was looking for the unique combinations and I came up with this method:

    def unique_columns(df,columns):
    
        result = pd.Series(index = df.index)
    
        groups = meta_data_csv.groupby(by = columns)
        for name,group in groups:
           is_unique = len(group) == 1
           result.loc[group.index] = is_unique
    
        assert not result.isnull().any()
    
        return result
    

    And if you only want to assert that all combinations are unique:

    df1.set_index(['A','B']).index.is_unique
    
    0 讨论(0)
  • 2020-11-27 11:07

    You can groupby on cols 'A' and 'B' and call size and then reset_index and rename the generated column:

    In [26]:
    
    df1.groupby(['A','B']).size().reset_index().rename(columns={0:'count'})
    Out[26]:
         A    B  count
    0   no   no      1
    1   no  yes      2
    2  yes   no      4
    3  yes  yes      3
    

    update

    A little explanation, by grouping on the 2 columns, this groups rows where A and B values are the same, we call size which returns the number of unique groups:

    In[202]:
    df1.groupby(['A','B']).size()
    
    Out[202]: 
    A    B  
    no   no     1
         yes    2
    yes  no     4
         yes    3
    dtype: int64
    

    So now to restore the grouped columns, we call reset_index:

    In[203]:
    df1.groupby(['A','B']).size().reset_index()
    
    Out[203]: 
         A    B  0
    0   no   no  1
    1   no  yes  2
    2  yes   no  4
    3  yes  yes  3
    

    This restores the indices but the size aggregation is turned into a generated column 0, so we have to rename this:

    In[204]:
    df1.groupby(['A','B']).size().reset_index().rename(columns={0:'count'})
    
    Out[204]: 
         A    B  count
    0   no   no      1
    1   no  yes      2
    2  yes   no      4
    3  yes  yes      3
    

    groupby does accept the arg as_index which we could have set to False so it doesn't make the grouped columns the index, but this generates a series and you'd still have to restore the indices and so on....:

    In[205]:
    df1.groupby(['A','B'], as_index=False).size()
    
    Out[205]: 
    A    B  
    no   no     1
         yes    2
    yes  no     4
         yes    3
    dtype: int64
    
    0 讨论(0)
  • 2020-11-27 11:07

    I haven't done time test with this but it was fun to try. Basically convert two columns to one column of tuples. Now convert that to a dataframe, do 'value_counts()' which finds the unique elements and counts them. Fiddle with zip again and put the columns in order you want. You can probably make the steps more elegant but working with tuples seems more natural to me for this problem

    b = pd.DataFrame({'A':['yes','yes','yes','yes','no','no','yes','yes','yes','no'],'B':['yes','no','no','no','yes','yes','no','yes','yes','no']})
    
    b['count'] = pd.Series(zip(*[b.A,b.B]))
    df = pd.DataFrame(b['count'].value_counts().reset_index())
    df['A'], df['B'] = zip(*df['index'])
    df = df.drop(columns='index')[['A','B','count']]
    
    0 讨论(0)
  • 2020-11-27 11:14

    Placing @EdChum's very nice answer into a function count_unique_index. The unique method only works on pandas series, not on data frames. The function below reproduces the behavior of the unique function in R:

    unique returns a vector, data frame or array like x but with duplicate elements/rows removed.

    And adds a count of the occurrences as requested by the OP.

    df1 = pd.DataFrame({'A':['yes','yes','yes','yes','no','no','yes','yes','yes','no'],                                                                                             
                        'B':['yes','no','no','no','yes','yes','no','yes','yes','no']})                                                                                               
    def count_unique_index(df, by):                                                                                                                                                 
        return df.groupby(by).size().reset_index().rename(columns={0:'count'})                                                                                                      
    
    count_unique_index(df1, ['A','B'])                                                                                                                                              
         A    B  count                                                                                                                                                                  
    0   no   no      1                                                                                                                                                                  
    1   no  yes      2                                                                                                                                                                  
    2  yes   no      4                                                                                                                                                                  
    3  yes  yes      3
    
    0 讨论(0)
提交回复
热议问题