Pandas - Fill NaN using multiple values

问题

I have a column ( lets call it Column X) containing around 16000 NaN values. The column has two possible values, 1 or 0 ( so like a binary )

I want to fill the NaN values in column X, but i don't want to use a single value for ALL the NaN entries.

say for instance that; i want to fill 50% of the NaN values with '1' and the other 50% with '0'.

I have read the ' fillna() ' documentation but i have not found any such relevant information which could satisfy this functionality.

I have literally no idea on how to move forward regarding this problem, so i haven't tried anything.

df['Column_x'] = df['Column_x'].fillna(df['Column_x'].mode()[0], inplace= True)

but this would fill ALL the NaN values in Column X of my dataframe 'df' with the mode of the column, i want to fill 50% with one value and other 50% with a different value.

Since i haven't tried anything yet, i can't show or describe any actual results.

what i can tell is that the expected result would be something along the lines of 8000 NaN values of column x replaced with '1' and another 8000 with '0' .

A visual result would be something like;

Before Handling NaN

Index     Column_x
0          0.0
1          0.0
2          0.0
3          0.0
4          0.0
5          0.0
6          1.0
7          1.0
8          1.0
9          1.0
10         1.0
11         1.0
12         NaN
13         NaN
14         NaN
15         NaN
16         NaN
17         NaN
18         NaN
19         NaN

After Handling NaN

Index     Column_x
0          0.0
1          0.0
2          0.0
3          0.0
4          0.0
5          0.0
6          1.0
7          1.0
8          1.0
9          1.0
10         1.0
11         1.0
12         0.0
13         0.0
14         0.0
15         0.0
16         1.0
17         1.0
18         1.0
19         1.0

回答1:

Using pandas.Series.sample:

mask = df['Column_x'].isna() 
ind = df['Column_x'].loc[mask].sample(frac=0.5).index
df.loc[ind, 'Column_x'] = 1
df['Column_x'] = df['Column_x'].fillna(0)
print(df)

Output:

    Index  Column_x
0       0       0.0
1       1       0.0
2       2       0.0
3       3       0.0
4       4       0.0
5       5       0.0
6       6       1.0
7       7       1.0
8       8       1.0
9       9       1.0
10     10       1.0
11     11       1.0
12     12       1.0
13     13       0.0
14     14       1.0
15     15       0.0
16     16       0.0
17     17       1.0
18     18       1.0
19     19       0.0

回答2:

You can use random.choices with its weights parameter to ensure the distribution stays the same. I've simulated a NaN column with numpy here, and get the exact length of the replacement needed. This approach can also be used for columns with more than two classes and more complex distributions.

import pandas as pd
import numpy as np
import random

df = pd.DataFrame({'col1': range(16000)})
df['col2'] = np.nan

nans = df['col2'].isna()
length = sum(nans)
replacement = random.choices([0, 1], weights=[.5, .5], k=length)
df.loc[nans,'col2'] = replacement

print(df.describe())

'''
Out:
               col1          col2
count  16000.000000  16000.000000
mean    7999.500000      0.507625
std     4618.946489      0.499957
min        0.000000      0.000000
25%     3999.750000      0.000000
50%     7999.500000      1.000000
75%    11999.250000      1.000000
max    15999.000000      1.000000
'''

回答3:

Use slicing columns and fill value

isnull() - function detect missing values in the given series object

Ex.

import pandas as pd

df = pd.DataFrame({'Column_y': pd.Series(range(9), index=['a', 'b', 'c','d','e','f','g','h','i']),
                   'Column_x': pd.Series(range(1), index=['a'])})

print(df)
# get list of index series which have NaN Column_x value
idx = df['Column_x'].index[df['Column_x'].isnull()]
total_nan_len = len(idx)
first_nan = total_nan_len//2
# fill first 50% of 1
df.loc[idx[0:first_nan], 'Column_x'] = 1
# fill last 50% of 0
df.loc[idx[first_nan:total_nan_len], 'Column_x'] = 0
print(df)

O/P:

Before Dataframe

   Column_y  Column_x
a         0       0.0
b         1       NaN
c         2       NaN
d         3       NaN
e         4       NaN
f         5       NaN
g         6       NaN
h         7       NaN
i         8       NaN

After Dataframe

   Column_y  Column_x
a         0       0.0
b         1       1.0
c         2       1.0
d         3       1.0
e         4       1.0
f         5       0.0
g         6       0.0
h         7       0.0
i         8       0.0

来源：https://stackoverflow.com/questions/57586275/pandas-fill-nan-using-multiple-values

标签

python

pandas

dataframe

nan

missing-data