I have a data frame with three string columns. I know that the only one value in the 3rd column is valid for every combination of the first two. To clean the data I have to
If you don't want to include NaN values, using Counter
is much much faster than pd.Series.mode
or pd.Series.value_counts()[0]
:
def get_most_common(srs):
x = list(srs)
my_counter = Counter(x)
return my_counter.most_common(1)[0][0]
df.groupby(col).agg(get_most_common)
should work. This will fail when you have NaN values, as each NaN will be counted separately.
The problem here is the performance, if you have a lot of rows it will be a problem.
If it is your case, please try with this:
import pandas as pd
source = pd.DataFrame({'Country' : ['USA', 'USA', 'Russia','USA'],
'City' : ['New-York', 'New-York', 'Sankt-Petersburg', 'New-York'],
'Short_name' : ['NY','New','Spb','NY']})
source.groupby(['Country','City']).agg(lambda x:x.value_counts().index[0])
source.groupby(['Country','City']).Short_name.value_counts().groupby['Country','City']).first()
If you want another approach for solving it that is does not depend on value_counts
or scipy.stats
you can use the Counter
collection
from collections import Counter
get_most_common = lambda values: max(Counter(values).items(), key = lambda x: x[1])[0]
Which can be applied to the above example like this
src = pd.DataFrame({'Country' : ['USA', 'USA', 'Russia','USA'],
'City' : ['New-York', 'New-York', 'Sankt-Petersburg', 'New-York'],
'Short_name' : ['NY','New','Spb','NY']})
src.groupby(['Country','City']).agg(get_most_common)
For agg
, the lambba function gets a Series
, which does not have a 'Short name'
attribute.
stats.mode
returns a tuple of two arrays, so you have to take the first element of the first array in this tuple.
With these two simple changements:
source.groupby(['Country','City']).agg(lambda x: stats.mode(x)[0][0])
returns
Short name
Country City
Russia Sankt-Petersburg Spb
USA New-York NY
pd.Series.mode
is available!Use groupby, GroupBy.agg, and apply the pd.Series.mode function to each group:
source.groupby(['Country','City'])['Short name'].agg(pd.Series.mode)
Country City
Russia Sankt-Petersburg Spb
USA New-York NY
Name: Short name, dtype: object
If this is needed as a DataFrame, use
source.groupby(['Country','City'])['Short name'].agg(pd.Series.mode).to_frame()
Short name
Country City
Russia Sankt-Petersburg Spb
USA New-York NY
The useful thing about Series.mode
is that it always returns a Series, making it very compatible with agg
and apply
, especially when reconstructing the groupby output. It is also faster.
# Accepted answer.
%timeit source.groupby(['Country','City']).agg(lambda x:x.value_counts().index[0])
# Proposed in this post.
%timeit source.groupby(['Country','City'])['Short name'].agg(pd.Series.mode)
5.56 ms ± 343 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
2.76 ms ± 387 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Series.mode
also does a good job when there are multiple modes:
source2 = source.append(
pd.Series({'Country': 'USA', 'City': 'New-York', 'Short name': 'New'}),
ignore_index=True)
# Now `source2` has two modes for the
# ("USA", "New-York") group, they are "NY" and "New".
source2
Country City Short name
0 USA New-York NY
1 USA New-York New
2 Russia Sankt-Petersburg Spb
3 USA New-York NY
4 USA New-York New
source2.groupby(['Country','City'])['Short name'].agg(pd.Series.mode)
Country City
Russia Sankt-Petersburg Spb
USA New-York [NY, New]
Name: Short name, dtype: object
Or, if you want a separate row for each mode, you can use GroupBy.apply:
source2.groupby(['Country','City'])['Short name'].apply(pd.Series.mode)
Country City
Russia Sankt-Petersburg 0 Spb
USA New-York 0 NY
1 New
Name: Short name, dtype: object
If you don't care which mode is returned as long as it's either one of them, then you will need a lambda that calls mode
and extracts the first result.
source2.groupby(['Country','City'])['Short name'].agg(
lambda x: pd.Series.mode(x)[0])
Country City
Russia Sankt-Petersburg Spb
USA New-York NY
Name: Short name, dtype: object
You can also use statistics.mode from python, but...
source.groupby(['Country','City'])['Short name'].apply(statistics.mode)
Country City
Russia Sankt-Petersburg Spb
USA New-York NY
Name: Short name, dtype: object
...it does not work well when having to deal with multiple modes; a StatisticsError
is raised. This is mentioned in the docs:
If data is empty, or if there is not exactly one most common value, StatisticsError is raised.
But you can see for yourself...
statistics.mode([1, 2])
# ---------------------------------------------------------------------------
# StatisticsError Traceback (most recent call last)
# ...
# StatisticsError: no unique mode; found 2 equally common values
The two top answers here suggest:
df.groupby(cols).agg(lambda x:x.value_counts().index[0])
or, preferably
df.groupby(cols).agg(pd.Series.mode)
However both of these fail in simple edge cases, as demonstrated here:
df = pd.DataFrame({
'client_id':['A', 'A', 'A', 'A', 'B', 'B', 'B', 'C'],
'date':['2019-01-01', '2019-01-01', '2019-01-01', '2019-01-01', '2019-01-01', '2019-01-01', '2019-01-01', '2019-01-01'],
'location':['NY', 'NY', 'LA', 'LA', 'DC', 'DC', 'LA', np.NaN]
})
The first:
df.groupby(['client_id', 'date']).agg(lambda x:x.value_counts().index[0])
yields IndexError
(because of the empty Series returned by group C
). The second:
df.groupby(['client_id', 'date']).agg(pd.Series.mode)
returns ValueError: Function does not reduce
, since the first group returns a list of two (since there are two modes). (As documented here, if the first group returned a single mode this would work!)
Two possible solutions for this case are:
import scipy
x.groupby(['client_id', 'date']).agg(lambda x: scipy.stats.mode(x)[0])
And the solution given to me by cs95 in the comments here:
def foo(x):
m = pd.Series.mode(x);
return m.values[0] if not m.empty else np.nan
df.groupby(['client_id', 'date']).agg(foo)
However, all of these are slow and not suited for large datasets. A solution I ended up using which a) can deal with these cases and b) is much, much faster, is a lightly modified version of abw33's answer (which should be higher):
def get_mode_per_column(dataframe, group_cols, col):
return (dataframe.fillna(-1) # NaN placeholder to keep group
.groupby(group_cols + [col])
.size()
.to_frame('count')
.reset_index()
.sort_values('count', ascending=False)
.drop_duplicates(subset=group_cols)
.drop(columns=['count'])
.sort_values(group_cols)
.replace(-1, np.NaN)) # restore NaNs
group_cols = ['client_id', 'date']
non_grp_cols = list(set(df).difference(group_cols))
output_df = get_mode_per_column(df, group_cols, non_grp_cols[0]).set_index(group_cols)
for col in non_grp_cols[1:]:
output_df[col] = get_mode_per_column(df, group_cols, col)[col].values
Essentially, the method works on one col at a time and outputs a df, so instead of concat
, which is intensive, you treat the first as a df, and then iteratively add the output array (values.flatten()
) as a column in the df.