问题
For starters, here is some artificial data fitting my problem:
df = pd.DataFrame(np.random.randint(0, 100, size=(vsize, 10)),
columns = ["col_{}".format(x) for x in range(10)],
index = range(0, vsize * 3, 3))
df_2 = pd.DataFrame(np.random.randint(0,100,size=(vsize, 10)),
columns = ["col_{}".format(x) for x in range(10, 20, 1)],
index = range(0, vsize * 2, 2))
df = df.merge(df_2, left_index = True, right_index = True, how = 'outer')
df_tar = pd.DataFrame({"tar_1": [np.random.randint(0, 2) for x in range(vsize * 3)],
"tar_2": [np.random.randint(0, 4) for x in range(vsize * 3)],
"tar_3": [np.random.randint(0, 8) for x in range(vsize * 3)],
"tar_4": [np.random.randint(0, 16) for x in range(vsize * 3)]})
df = df.merge(df_tar, left_index = True, right_index = True, how = 'inner')
Now, I would like to fill NaN values in each column, with a MEDIAN value of non-NaN values in each column, but with noise added to each filled NaN in that column. The MEDIAN value should be calculated for values in that column, which belong to the same class, as marked in column tar_4 at first. Then, if any NaNs persist in the column (because some values in the column were all in tar_4 class which featured only NaNs, so no MEDIAN could be calculated), the same operation is repeated on the updated column (with some NaN's already filled in from tar_4 operation), but with values belonging to the same class relative to tar_3 column. Then tar_2, and tar_1.
The way I imagine it would be as follows:
- col_1 features e.g. 6 non-Nan & 4 NaN values: [1, 2, NaN, 4, NaN, 12, 5, NaN, 1, NaN]
- only values [1, 2, NaN, 4, NaN] belong to the same class (e.g. class 1) in tar_4, so they are pushed through NaN filling:
- NaN value at index [2] gets filled with MEDIAN (=2) + random(-3, 3) * std error of distribution in col_1, e.g. 2 + (1 * 1.24)
- NaN value at index [4] gets filled with MEDIAN (=2) + random(-3, 3) * std error of distribution in col_1, e.g. 2 + (-2 * 1.24)
- Now col_1 has the following 8 non-NaN and 2 NaN values: [1, 2, 1.24, 4, -0.48, 12, 5, NaN, 1, NaN]
- Column col_1 still features some NaN values, so grouping based on common class in tar_3 column is applied:
- out of [1, 2, 1.24, 4, -0.48, 12, 5, NaN, 1, NaN], values [1, 2, 1.24, 4, -0.48, 12, 5, NaN] are in the same class now, so they get processed:
- NaN value at index [7] gets assigned MEDIAN of values in indices [0-6] (=2) + random(-3, 3) * std error, e.g. 2 + 2 * 3.86
- now col_1 has 9 non-NaN values and 1 NaN value: [1, 2, 1.24, 4, -0.48, 12, 5, 9.72, 1, NaN]
- all values in col_1 belong to the same class based on tar_2 column, so NaN value at index [9] gets processed with the same logic, as described above, and ends up with value 2 * (-1 * 4.05)
- col_1 now features only non-NaN values: [1, 2, 1.24, 4, -0.48, 12, 5, 9.72, 1, -6.09], and does not need to to be pushed through NaN filling based on tar_1 column.
The same logic goes through the rest of columns.
So, the expected output: DataFrame with filled NaN values, in each column based on decreasing level of granularity of classes based on columns tar_4 - tar_1.
I already have a code, which kind of achieves that, thanks to @Quang Hoang:
def min_max_check(col):
if ((df[col].dropna() >= 0) & (df[col].dropna() <= 1.0)).all():
return medians[col]
elif (df[col].dropna() >= 0).all():
return medians[col] + round(np.random.randint(low = 0, high = 3) * stds[col], 2)
else:
return medians[col] + round(np.random.randint(low = -3, high = 3) * stds[col], 2)
tar_list = ['tar_4', 'tar_3', 'tar_2', 'tar_1']
cols = [col for col in df.columns if col not in tar_list]
# since your dataframe may not have continuous index
idx = df.index
for tar in tar_list:
medians = df[cols].groupby(by = df[tar]).agg('median')
std = df[cols].groupby(by = df[tar]).agg(np.std)
df.set_index(tar, inplace=True)
for col in cols:
df[col] = df[col].fillna(min_max_check(col))
df.reset_index(inplace=True)
df.index = idx
However, this only fills the NaN values with the same MEDIAN value + noise, at each granularity level. How can this code be enhanced to generate varied fill values for each NaN value at e.g. tar_4, tar_3, tar_2 and tar_1 levels?
回答1:
One quick solution is to modify your min_max_check
to get_noise
at each row:
def gen_noise(col):
num_row = len(df)
# generate noise of the same height as our dataset
# notice the size argument in randint
if ((df[col].dropna() >= 0) & (df[col].dropna() <= 1.0)).all():
noise = 0
elif (df[col].dropna() >= 0).all():
noise = np.random.randint(low = 0,
high = 3,
size=num_row)
else:
noise = np.random.randint(low = -3,
high = 3,
size=num_row)
# multiplication with isna() forces those at non-null values in df[col] to be 0
return noise * df[col].isna()
And then later:
df.set_index(tar, inplace=True)
for col in cols[:1]:
noise = gen_noise(col)
df[col] = (df[col].fillna(medians[col])
.add(noise.mul(stds[col]).values)
)
df.reset_index(inplace=True)
Note: You can modify the code further in the sense that you generate the noise_df
with the same size with medians
and stds
, something like this
for tar in tar_list:
medians = df[cols].groupby(df[tar]).agg('median')
stds = df[cols].groupby(df[tar]).agg('std')
# generate noise_df here
medians = medians + round(noise_df*std, 2)
df.set_index(tar, inplace=True)
for col in cols[:1]:
df[col] = df[col].fillna(medians[col])
df.reset_index(inplace=True)
df.index = idx
来源:https://stackoverflow.com/questions/56178297/variable-fillna-in-each-column