Suppose I have the following data frame:
0 1 2
new NaN NaN
new one one
a b c
NaN NaN NaN
How would I
It is not as fast as coldspeed's answer with set()
, but you could also do
df['_num_unique_values'] = df.T.nunique()
First the transpose of df
dataframe is taken with df.T
and then nunique()
is used to get the count of unique values excluding NaN
s.
This is added as a new column to the original dataframe.
df
would now be
0 1 2 _num_unique_values
0 new nan nan 1
1 new one one 2
2 a b c 3
3 nan nan nan 0
A more abstract solution:
df['num_uniq']=df.nunique(axis=1)
Use a list comprehension.... with set
:
df['num_uniq'] = [len(set(v[pd.notna(v)].tolist())) for v in df.values]
df
0 1 2 num_uniq
0 new NaN NaN 1
1 new one one 2
2 a b c 3
3 NaN NaN NaN 0
You could do this with stack
, groupby
and nunique
.
# df.join(df.stack().groupby(level=0).nunique().to_frame('num_uniq'))
df['num_uniq'] = df.stack().groupby(level=0).nunique()
df
0 1 2 num_uniq
0 new NaN NaN 1.0
1 new one one 2.0
2 a b c 3.0
3 NaN NaN NaN NaN
Yet another option is apply
and nunique
:
df['num_uniq'] = df.apply(pd.Series.nunique, axis=1)
df
0 1 2 num_uniq
0 new NaN NaN 1
1 new one one 2
2 a b c 3
3 NaN NaN NaN 0
Performance
df_ = df
df = pd.concat([df_] * 1000, ignore_index=True)
%timeit df['num_uniq'] = [len(set(v[pd.notna(v)])) for v in df.values]
%timeit df['num_uniq'] = df.stack().groupby(level=0).nunique()
%timeit df['num_uniq'] = df.apply(pd.Series.nunique, axis=1)
%timeit df['num_uniq'] = df.nunique(1)
196 ms ± 10.1 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
6.34 ms ± 343 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
679 ms ± 24 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
3.21 ms ± 343 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Just use nunique(axis=1).
import numpy as np
import pandas as pd
data={0:['new','new','a',np.nan],
1:[np.nan,'one','b', np.nan],
2:[np.nan,np.nan,'c',np.nan]}
df = pd.DataFrame(data)
print(df.nunique(axis=1))
df['num_unique'] = df.nunique(axis=1)
See: