Efficient solution for forward filling missing values in a pandas dataframe column?

孤街醉人 提交于 2019-12-06 15:48:06

You need to sort by both columns df.sort_values(['a', 'b']).ffill() to ensure robustness. If an np.nan is left in the first position within a group, ffill will fill that with a value from the prior group. Because np.nan will be placed at the end of any sort, sorting by both a and b ensures that you will not have np.nan at the front of any group. You can then .loc or .reindex with the initial index to get back your original order.

This will obviously be a tad slower than the other proposals... However, I contend it will be correct where the others are not.

demo

Consider the dataframe df

df = pd.DataFrame({'a': [1,1,2,2,2], 'b': [1, np.nan, np.nan, 2, np.nan]})

print(df)

   a    b
0  1  1.0
1  1  NaN
2  2  NaN
3  2  2.0
4  2  NaN

Try

df.sort_values('a').ffill()

   a    b
0  1  1.0
1  1  1.0
2  2  1.0  # <--- this is incorrect
3  2  2.0
4  2  2.0

Instead do

df.sort_values(['a', 'b']).ffill().loc[df.index]

   a    b
0  1  1.0
1  1  1.0
2  2  2.0
3  2  2.0
4  2  2.0

special note
This is still incorrect if an entire group has missing values

Using ffill() directly will give the best results. Here is the comparison

%timeit df.b.ffill(inplace = True)
best of 3: 311 µs per loop

%timeit df['b'] = df.groupby('a')['b'].transform(lambda x: x.fillna(method='ffill'))
best of 3: 2.34 ms per loop

%timeit df['b'] = df.groupby('a')['b'].fillna(method='ffill')
best of 3: 4.41 ms per loop

what about this

df.groupby('a').b.transform('ffill')
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!