I have a data frame and I would like to know how many times a given column has the most frequent value.
I try to do it in the following way:
items_co
The NaN values are omitted for calculating frequencies. Please check your code functionality here But you can use the below code for same functionality.
**>> Code:**
# Importing required module
from collections import Counter
# Creating a dataframe
df = pd.DataFrame({ 'A':["jan","jan","jan","mar","mar","feb","jan","dec",
"mar","jan","dec"] })
# Creating a counter object
count = Counter(df['A'])
# Calling a method of Counter object(count)
count.most_common(3)
**>> Output:**
[('jan', 5), ('mar', 3), ('dec', 2)]
Add this line of code to find the most frequent value
df["item"].value_counts().nlargest(n=1).values[0]
To continue to @jonathanrocher answer you could use mode in pandas DataFrame. It'll give a most frequent values (one or two) across the rows or columns:
import pandas as pd
import numpy as np
df = pd.DataFrame({"a": [1,2,2,4,2], "b": [np.nan, np.nan, np.nan, 3, 3]})
In [2]: df.mode()
Out[2]:
a b
0 2 3.0
Just take the first row of your items_counts
series:
top = items_counts.head(1) # or items_counts.iloc[[0]]
value, count = top.index[0], top.iat[0]
This works because pd.Series.value_counts has sort=True
by default and so is already ordered by counts, highest count first. Extracting a value from an index by location has O(1) complexity, while pd.Series.idxmax has O(n) complexity where n is the number of categories.
Specifying sort=False
is still possible and then idxmax
is recommended:
items_counts = df['item'].value_counts(sort=False)
top = items_counts.loc[[items_counts.idxmax()]]
value, count = top.index[0], top.iat[0]
Notice in this case you don't need to call max
and idxmax
separately, just extract the index via idxmax
and feed to the loc
label-based indexer.
You may also consider using scipy's mode
function which ignores NaN. A solution using it could look like:
from scipy.stats import mode
from numpy import nan
df = DataFrame({"a": [1,2,2,4,2], "b": [nan, nan, nan, 3, 3]})
print mode(df)
The output would look like
(array([[ 2., 3.]]), array([[ 3., 2.]]))
meaning that the most common values are 2
for the first columns and 3
for the second, with frequencies 3
and 2
respectively.
It looks like you may have some nulls in the column. You can drop them with df = df.dropna(subset=['item'])
. Then df['item'].value_counts().max()
should give you the max counts, and df['item'].value_counts().idxmax()
should give you the most frequent value.