问题
I need to forward fill values in a column of a dataframe within groups. I should note that the first value in a group is never missing by construction. I have the following solutions at the moment.
df = pd.DataFrame({'a': [1,1,2,2,2], 'b': [1, np.nan, 2, np.nan, np.nan]})
# desired output
a b
1 1
1 1
2 2
2 2
2 2
Here are the three solutions that I've tried so far.
# really slow solutions
df['b'] = df.groupby('a')['b'].transform(lambda x: x.fillna(method='ffill'))
df['b'] = df.groupby('a')['b'].fillna(method='ffill')
# much faster solution, but more memory intensive and ugly all around
tmp = df.drop_duplicates('a', keep='first')
df.drop('b', inplace=True, axis=1)
df = df.merge(tmp, on='a')
All three of these produce my desired output, but the first two take a really long time on my data set, and the third solution is more memory intensive and feels rather clunky. Are there any other ways to forward fill a column?
回答1:
You need to sort by both columns df.sort_values(['a', 'b']).ffill()
to ensure robustness. If an np.nan
is left in the first position within a group, ffill
will fill that with a value from the prior group. Because np.nan
will be placed at the end of any sort, sorting by both a
and b
ensures that you will not have np.nan
at the front of any group. You can then .loc
or .reindex
with the initial index to get back your original order.
This will obviously be a tad slower than the other proposals... However, I contend it will be correct where the others are not.
demo
Consider the dataframe df
df = pd.DataFrame({'a': [1,1,2,2,2], 'b': [1, np.nan, np.nan, 2, np.nan]})
print(df)
a b
0 1 1.0
1 1 NaN
2 2 NaN
3 2 2.0
4 2 NaN
Try
df.sort_values('a').ffill()
a b
0 1 1.0
1 1 1.0
2 2 1.0 # <--- this is incorrect
3 2 2.0
4 2 2.0
Instead do
df.sort_values(['a', 'b']).ffill().loc[df.index]
a b
0 1 1.0
1 1 1.0
2 2 2.0
3 2 2.0
4 2 2.0
special note
This is still incorrect if an entire group has missing values
回答2:
Using ffill() directly will give the best results. Here is the comparison
%timeit df.b.ffill(inplace = True)
best of 3: 311 µs per loop
%timeit df['b'] = df.groupby('a')['b'].transform(lambda x: x.fillna(method='ffill'))
best of 3: 2.34 ms per loop
%timeit df['b'] = df.groupby('a')['b'].fillna(method='ffill')
best of 3: 4.41 ms per loop
回答3:
what about this
df.groupby('a').b.transform('ffill')
来源:https://stackoverflow.com/questions/43075747/efficient-solution-for-forward-filling-missing-values-in-a-pandas-dataframe-colu