Pandas Fillna Mode

后端未结

关注

 6  1279

I have a data set in which there is a column known as Native Country which contain around 30000 records. Some are missing represented by NaN so I thoug

相关标签:

6条回答

无人共我

2021-02-05 12:23

Just call first element of series:

data['Native Country'].fillna(data['Native Country'].mode()[0], inplace=True)

or you can do the same with assisgnment:

data['Native Country'] = data['Native Country'].fillna(data['Native Country'].mode()[0])

0 讨论(0)

忘掉有多难

2021-02-05 12:23

Be careful, NaN may be the mode of your dataframe: in this case, you are replacing NaN with another NaN.

0 讨论(0)
发布评论:

提交评论
- 加载中...
佛祖请我去吃肉

2021-02-05 12:26

Try something like: fill_mode = lambda col: col.fillna(col.mode()) and for the function: new_df = df.apply(fill_mode, axis=0)

0 讨论(0)
发布评论:

提交评论
- 加载中...

情话喂你

2021-02-05 12:32

You can get the number 'mode' or any another strategy

num = data['Native Country'].mode()
data['Native Country'].fillna(num, inplace=True)

or in one line like this

data['Native Country'].fillna(data['Native Country'].mode()[0], inplace=True)

0 讨论(0)

天命终不由人

2021-02-05 12:37

import numpy as np

import pandas as pd

print(pd.__version__)

1.2.0

df = pd.DataFrame({'Country': [np.nan, 'France', np.nan, 'Spain', 'France'], 'Purchased': [np.nan,'Yes', 'Yes', 'No', np.nan]})

	Country	Purchased
0	NaN	NaN
1	France	Yes
2	NaN	Yes
3	Spain	No
4	France	NaN

 df.fillna(df.mode())  ## only applied on first row because df.mode() returns a dataframe with one row

	Country	Purchased
0	France	Yes
1	France	Yes
2	NaN	Yes
3	Spain	No
4	France	NaN

df = pd.DataFrame({'Country': [np.nan, 'France', np.nan, 'Spain', 'France'], 'Purchased': [np.nan,'Yes', 'Yes', 'No', np.nan]})

df.fillna(df.mode().iloc[0]) ## convert df to a series

	Country	Purchased
0	France	Yes
1	France	Yes
2	France	Yes
3	Spain	No
4	France	Yes

0 讨论(0)

萌比男神i

2021-02-05 12:44
If we fill in the missing values with fillna(df['colX'].mode()), since the result of mode() is a Series, it will only fill in the first couple of rows for the matching indices. At least if done as below:
```
fill_mode = lambda col: col.fillna(col.mode())
df.apply(fill_mode, axis=0)
```
However, by simply taking the first value of the Series fillna(df['colX'].mode()[0]), I think we risk introducing unintended bias in the data. If the sample is multimodal, taking just the first mode value makes the already biased imputation method worse. For example, taking only 0 if we have [0, 21, 99] as the equally most frequent values. Or filling missing values with False when True and False values are equally frequent in a given column.

I don't have a clear cut solution here. Assigning a random value from all the local maxima could be one approach if using the mode is a necessity.
0 讨论(0)
发布评论:

提交评论
- 加载中...