I am trying to find the count of distinct values in each column using Pandas. This is what I did.
import pandas as pd
import numpy as np
# Generate data.
NR
A Pandas.Series
has a .value_counts()
function that provides exactly what you want to. Check out the documentation for the function.
I found:
df.agg(['nunique']).T
much faster
As of pandas 0.20 we can use nunique
directly on DataFrame
s, i.e.:
df.nunique()
a 4
b 5
c 1
dtype: int64
Other legacy options:
You could do a transpose of the df and then using apply call nunique row-wise:
In [205]:
df = pd.DataFrame({'a':[0,1,1,2,3],'b':[1,2,3,4,5],'c':[1,1,1,1,1]})
df
Out[205]:
a b c
0 0 1 1
1 1 2 1
2 1 3 1
3 2 4 1
4 3 5 1
In [206]:
df.T.apply(lambda x: x.nunique(), axis=1)
Out[206]:
a 4
b 5
c 1
dtype: int64
EDIT
As pointed out by @ajcr the transpose is unnecessary:
In [208]:
df.apply(pd.Series.nunique)
Out[208]:
a 4
b 5
c 1
dtype: int64
df.apply(lambda x: len(x.unique()))
Recently, I have same issues of counting unique value of each column in DataFrame, and I found some other function that runs faster than the apply
function:
#Select the way how you want to store the output, could be pd.DataFrame or Dict, I will use Dict to demonstrate:
col_uni_val={}
for i in df.columns:
col_uni_val[i] = len(df[i].unique())
#Import pprint to display dic nicely:
import pprint
pprint.pprint(col_uni_val)
This works for me almost twice faster than df.apply(lambda x: len(x.unique()))
Adding the example code for the answer given by @CaMaDuPe85
df = pd.DataFrame({'a':[0,1,1,2,3],'b':[1,2,3,4,5],'c':[1,1,1,1,1]})
df
# df
a b c
0 0 1 1
1 1 2 1
2 1 3 1
3 2 4 1
4 3 5 1
for cs in df.columns:
print(cs,df[cs].value_counts().count())
# using value_counts in each column and count it
# Output
a 4
b 5
c 1