问题
I have the following dataframe:
import numpy as np
import pandas as pd
df = pd.DataFrame(data={'Cat' : ['A', 'A', 'A','B', 'B', 'A', 'B'],
'Vals' : [1, 2, 3, 4, 5, np.nan, np.nan]})
Cat Vals
0 A 1
1 A 2
2 A 3
3 B 4
4 B 5
5 A NaN
6 B NaN
And I want indexes 5
and 6
to be filled with the conditional mean of 'Vals' based on the 'Cat' column, namely 2
and 4.5
The following code works fine:
means = df.groupby('Cat').Vals.mean()
for i in df[df.Vals.isnull()].index:
df.loc[i, 'Vals'] = means[df.loc[i].Cat]
Cat Vals
0 A 1
1 A 2
2 A 3
3 B 4
4 B 5
5 A 2
6 B 4.5
But I'm looking for something nicer, like
df.Vals.fillna(df.Vals.mean(Conditionally to column 'Cat'))
Edit: I found this, which is one line shorter, but I'm still not happy with it:
means = df.groupby('Cat').Vals.mean()
df.Vals = df.apply(lambda x: means[x.Cat] if pd.isnull(x.Vals) else x.Vals, axis=1)
回答1:
We wish to "associate" the Cat
values with the missing NaN locations.
In Pandas such associations are always done via the index.
So it is natural to set Cat
as the index:
df = df.set_index(['Cat'])
Once this is done, then fillna
works as desired:
df['Vals'] = df['Vals'].fillna(means)
To return Cat
to a column, you could then of course use reset_index
:
df = df.reset_index()
import pandas as pd
import numpy as np
df = pd.DataFrame(
{'Cat' : ['A', 'A', 'A','B', 'B', 'A', 'B'],
'Vals' : [1, 2, 3, 4, 5, np.nan, np.nan]})
means = df.groupby(['Cat'])['Vals'].mean()
df = df.set_index(['Cat'])
df['Vals'] = df['Vals'].fillna(means)
df = df.reset_index()
print(df)
yields
Cat Vals
0 A 1.0
1 A 2.0
2 A 3.0
3 B 4.0
4 B 5.0
5 A 2.0
6 B 4.5
来源:https://stackoverflow.com/questions/33457191/python-pandas-dataframe-fill-nans-with-a-conditional-mean