Finding count of distinct elements in DataFrame in each column

前端 未结 8 1122
半阙折子戏
半阙折子戏 2020-12-02 18:27

I am trying to find the count of distinct values in each column using Pandas. This is what I did.

import pandas as pd
import numpy as np

# Generate data.
NR         


        
相关标签:
8条回答
  • 2020-12-02 18:42

    A Pandas.Series has a .value_counts() function that provides exactly what you want to. Check out the documentation for the function.

    0 讨论(0)
  • 2020-12-02 18:43

    I found:

    df.agg(['nunique']).T
    

    much faster

    0 讨论(0)
  • 2020-12-02 18:47

    As of pandas 0.20 we can use nunique directly on DataFrames, i.e.:

    df.nunique()
    a    4
    b    5
    c    1
    dtype: int64
    

    Other legacy options:

    You could do a transpose of the df and then using apply call nunique row-wise:

    In [205]:
    df = pd.DataFrame({'a':[0,1,1,2,3],'b':[1,2,3,4,5],'c':[1,1,1,1,1]})
    df
    
    Out[205]:
       a  b  c
    0  0  1  1
    1  1  2  1
    2  1  3  1
    3  2  4  1
    4  3  5  1
    
    In [206]:
    df.T.apply(lambda x: x.nunique(), axis=1)
    
    Out[206]:
    a    4
    b    5
    c    1
    dtype: int64
    

    EDIT

    As pointed out by @ajcr the transpose is unnecessary:

    In [208]:
    df.apply(pd.Series.nunique)
    
    Out[208]:
    a    4
    b    5
    c    1
    dtype: int64
    
    0 讨论(0)
  • 2020-12-02 18:50
    df.apply(lambda x: len(x.unique()))
    
    0 讨论(0)
  • 2020-12-02 18:54

    Recently, I have same issues of counting unique value of each column in DataFrame, and I found some other function that runs faster than the apply function:

    #Select the way how you want to store the output, could be pd.DataFrame or Dict, I will use Dict to demonstrate:
    col_uni_val={}
    for i in df.columns:
        col_uni_val[i] = len(df[i].unique())
    
    #Import pprint to display dic nicely:
    import pprint
    pprint.pprint(col_uni_val)
    

    This works for me almost twice faster than df.apply(lambda x: len(x.unique()))

    0 讨论(0)
  • 2020-12-02 18:54

    Adding the example code for the answer given by @CaMaDuPe85

    df = pd.DataFrame({'a':[0,1,1,2,3],'b':[1,2,3,4,5],'c':[1,1,1,1,1]})
    df
    
    # df
        a   b   c
    0   0   1   1
    1   1   2   1
    2   1   3   1
    3   2   4   1
    4   3   5   1
    
    
    for cs in df.columns:
        print(cs,df[cs].value_counts().count()) 
        # using value_counts in each column and count it 
    
    # Output
    
    a 4
    b 5
    c 1
    
    0 讨论(0)
提交回复
热议问题