I have a data set in which there is a column known as Native Country which contain around 30000
records. Some are missing represented by NaN
so I thoug
Just call first element of series:
data['Native Country'].fillna(data['Native Country'].mode()[0], inplace=True)
or you can do the same with assisgnment:
data['Native Country'] = data['Native Country'].fillna(data['Native Country'].mode()[0])
Be careful, NaN may be the mode of your dataframe: in this case, you are replacing NaN with another NaN.
Try something like:
fill_mode = lambda col: col.fillna(col.mode())
and for the function:
new_df = df.apply(fill_mode, axis=0)
You can get the number 'mode' or any another strategy
num = data['Native Country'].mode()
data['Native Country'].fillna(num, inplace=True)
or in one line like this
data['Native Country'].fillna(data['Native Country'].mode()[0], inplace=True)
import numpy as np
import pandas as pd
print(pd.__version__)
1.2.0
df = pd.DataFrame({'Country': [np.nan, 'France', np.nan, 'Spain', 'France'], 'Purchased': [np.nan,'Yes', 'Yes', 'No', np.nan]})
Country | Purchased | |
---|---|---|
0 | NaN | NaN |
1 | France | Yes |
2 | NaN | Yes |
3 | Spain | No |
4 | France | NaN |
df.fillna(df.mode()) ## only applied on first row because df.mode() returns a dataframe with one row
Country | Purchased | |
---|---|---|
0 | France | Yes |
1 | France | Yes |
2 | NaN | Yes |
3 | Spain | No |
4 | France | NaN |
df = pd.DataFrame({'Country': [np.nan, 'France', np.nan, 'Spain', 'France'], 'Purchased': [np.nan,'Yes', 'Yes', 'No', np.nan]})
df.fillna(df.mode().iloc[0]) ## convert df to a series
Country | Purchased | |
---|---|---|
0 | France | Yes |
1 | France | Yes |
2 | France | Yes |
3 | Spain | No |
4 | France | Yes |
If we fill in the missing values with fillna(df['colX'].mode())
, since the result of mode()
is a Series, it will only fill in the first couple of rows for the matching indices. At least if done as below:
fill_mode = lambda col: col.fillna(col.mode())
df.apply(fill_mode, axis=0)
However, by simply taking the first value of the Series fillna(df['colX'].mode()[0])
, I think we risk introducing unintended bias in the data. If the sample is multimodal, taking just the first mode value makes the already biased imputation method worse. For example, taking only 0
if we have [0, 21, 99]
as the equally most frequent values. Or filling missing values with False
when True
and False
values are equally frequent in a given column.
I don't have a clear cut solution here. Assigning a random value from all the local maxima could be one approach if using the mode is a necessity.