Pandas.DataFrame interpolate() with method='linear' and 'nearest' returns inconsistent results for trailing NaN

匿名 (未验证) 提交于 2019-12-03 01:35:01

问题:

I was exploring pandas.DataFrame.interpolate() with different methods, linear vs. nearest, and I found different outputs from the two methods when there is missing data at the trailing.

For example:

import pandas as pd # version: '0.16.2' or '0.20.3' >>> a = pd.DataFrame({'col1': [np.nan, 1, np.nan, 3, np.nan, 5, np.nan]}) Out[1]:     col1 0   NaN 1   1.0 2   NaN 3   3.0 4   NaN 5   5.0 6   NaN  >>> a.interpolate(method='linear') Out[2]:     col1 0   NaN 1   1.0 2   2.0 3   3.0 4   4.0 5   5.0 6   5.0  >>> a.interpolate(method='nearest') Out[3]:     col1 0   NaN 1   1.0 2   1.0 3   3.0 4   3.0 5   5.0 6   NaN  

It seems that linear method will do extrapolation of the trailing NaN while "nearest" method will not, unless you specify fill_value = 'extrapolate':

>>> a.interpolate(method='nearest', fill_value='extrapolate') Out[4]:     col1 0   NaN 1   1.0 2   1.0 3   3.0 4   3.0 5   5.0 6   5.0 

So my question is why the two methods behave differently on handling trailing NaN? Is it what it is supposed to be or it is a bug?

The same results were found with two versions of pandas, '0.16.2' and '0.20.3'.

pandas.Series.interpolate() also shows the same issue.

There is a thread and a github issue talking about a similar problem but with a different purpose. I am looking for an explanation or a conclusion for this issue.

EDIT:

Correction: the way linear method behaves is not exactly extrapolation, as you can see the filled value of the last row is 5 instead of 6. It looks more like a bug now, is it?

回答1:

By default, df.interpolate(method='linear') forward-fills NaNs after the last valid value. That is rather surprising given that the method name only mentions "interpolate".

To restrict df.interpolate to only interpolate NaNs between valid (non-NaN) values, as of Pandas version 0.23.0 (Reference), use limit_area='inside'.

import pandas as pd import numpy as np a = pd.DataFrame({'col1': [np.nan, 1, np.nan, 3, np.nan, 5, np.nan]}) a['linear'] = a.interpolate(method='linear')['col1'] a['linear inside'] = a.interpolate(method='linear', limit_area='inside')['col1'] print(a) 

yields

   col1  linear  linear inside 0   NaN     NaN            NaN 1   1.0     1.0            1.0 2   NaN     2.0            2.0 3   3.0     3.0            3.0 4   NaN     4.0            4.0 5   5.0     5.0            5.0 6   NaN     5.0            NaN 


回答2:

@D.Weis it is great question let me explain in deep and there is no thread and github issue. Let me explain step by step.

>>> a = pd.DataFrame({'col1': [np.nan, 1, np.nan, 3, np.nan, 5, np.nan]}) Out[1]:     col1 0   NaN 1   1.0 2   NaN 3   3.0 4   NaN 5   5.0 6   NaN 

1.)Interpolation by 'linear'

In 'linear' interpolation the missing values are filled up by two nearest position value. While in 'nearest' interpolation it will fill up the missing values by nearest surrounding values, however, in 'nearest' the missing value will have the same values as nearby position value. I have explained 'nearest' interpolation more deeply in section (2).

Emaple for 'linear' interpolation:

    1   1.0    1. 1.0      2   NaN    2. 2.0     3   3.0    3. 3.0     4   NaN    4. 4.0 

Here, the 2nd position is empty. So to fill it up the values it will take values of position 1st and 3rd which is 1.0 and 3.0 respectively. Remember again in 'linear' interpolation it takes just 2 surrounding values to fill it up the missing value.

(1.0+3.0/2) =2.0 = Answer for  2nd position. Similarly it will be for other values. 

2.) Interpolate by 'nearest'

>>> a.interpolate(method='nearest') Out[3]:     col1 0   NaN 1   1.0 2   1.0 3   3.0 4   3.0 5   5.0 6   NaN 

Basically, in 'nearest' interpolation it fills up missing values by the same values from nearest values. For instance,

1   1.0    1. 1.0  2   NaN    2. 1.0 3   3.0    3. 3.0 4   NaN    4. 3.0 

So, in the above example, you can easily see that position 2nd takes the same valeus of position 1st because it is the nearest value to the 1st position. In, short just keep in mind that in 'nearest' interpolation the missing values are filled up by the same value with the help of nearest surrounding values.

In method='nearest', fill_value='extrapolate' you can see in your example it will fill up the last values with the same value of the 5th position. The concept remains the same as for filling up missing values as explained above.

NOTE: Moreover, there are other interpolation methods such as 'bilinear','bicubic', etc. It is all about accuracy for filling up the missing values.

My suggestion is if you want to make a selection from 'nearest' and 'linear' interpolation. I would say go with 'linear' interpolation because it will fill up values more accurately than that of 'nearest' interpolation.

Hopefully, this will help you. Good luck!



标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!