I was exploring pandas.DataFrame.interpolate()
with different methods, linear
vs. nearest
, and I found different outputs from the two methods when there is missing data at the trailing.
For example:
import pandas as pd # version: '0.16.2' or '0.20.3' >>> a = pd.DataFrame({'col1': [np.nan, 1, np.nan, 3, np.nan, 5, np.nan]}) Out[1]: col1 0 NaN 1 1.0 2 NaN 3 3.0 4 NaN 5 5.0 6 NaN >>> a.interpolate(method='linear') Out[2]: col1 0 NaN 1 1.0 2 2.0 3 3.0 4 4.0 5 5.0 6 5.0 >>> a.interpolate(method='nearest') Out[3]: col1 0 NaN 1 1.0 2 1.0 3 3.0 4 3.0 5 5.0 6 NaN
It seems that linear
method will do extrapolation of the trailing NaN while "nearest" method will not, unless you specify fill_value = 'extrapolate'
:
>>> a.interpolate(method='nearest', fill_value='extrapolate') Out[4]: col1 0 NaN 1 1.0 2 1.0 3 3.0 4 3.0 5 5.0 6 5.0
So my question is why the two methods behave differently on handling trailing NaN? Is it what it is supposed to be or it is a bug?
The same results were found with two versions of pandas, '0.16.2' and '0.20.3'.
pandas.Series.interpolate()
also shows the same issue.
There is a thread and a github issue talking about a similar problem but with a different purpose. I am looking for an explanation or a conclusion for this issue.
EDIT:
Correction: the way linear
method behaves is not exactly extrapolation
, as you can see the filled value of the last row is 5 instead of 6. It looks more like a bug now, is it?
By default, df.interpolate(method='linear')
forward-fills NaNs after the last valid value. That is rather surprising given that the method name only mentions "interpolate".
To restrict df.interpolate
to only interpolate NaNs between valid (non-NaN) values, as of Pandas version 0.23.0 (Reference), use limit_area='inside'
.
import pandas as pd import numpy as np a = pd.DataFrame({'col1': [np.nan, 1, np.nan, 3, np.nan, 5, np.nan]}) a['linear'] = a.interpolate(method='linear')['col1'] a['linear inside'] = a.interpolate(method='linear', limit_area='inside')['col1'] print(a)
yields
col1 linear linear inside 0 NaN NaN NaN 1 1.0 1.0 1.0 2 NaN 2.0 2.0 3 3.0 3.0 3.0 4 NaN 4.0 4.0 5 5.0 5.0 5.0 6 NaN 5.0 NaN
@D.Weis it is great question let me explain in deep and there is no thread and github issue. Let me explain step by step.
>>> a = pd.DataFrame({'col1': [np.nan, 1, np.nan, 3, np.nan, 5, np.nan]}) Out[1]: col1 0 NaN 1 1.0 2 NaN 3 3.0 4 NaN 5 5.0 6 NaN
1.)Interpolation by 'linear'
In 'linear' interpolation the missing values are filled up by two nearest position value. While in 'nearest' interpolation it will fill up the missing values by nearest surrounding values, however, in 'nearest' the missing value will have the same values as nearby position value. I have explained 'nearest' interpolation more deeply in section (2).
Emaple for 'linear' interpolation:
1 1.0 1. 1.0 2 NaN 2. 2.0 3 3.0 3. 3.0 4 NaN 4. 4.0
Here, the 2nd position is empty. So to fill it up the values it will take values of position 1st and 3rd which is 1.0 and 3.0 respectively. Remember again in 'linear' interpolation it takes just 2 surrounding values to fill it up the missing value.
(1.0+3.0/2) =2.0 = Answer for 2nd position. Similarly it will be for other values.
2.) Interpolate by 'nearest'
>>> a.interpolate(method='nearest') Out[3]: col1 0 NaN 1 1.0 2 1.0 3 3.0 4 3.0 5 5.0 6 NaN
Basically, in 'nearest' interpolation it fills up missing values by the same values from nearest values. For instance,
1 1.0 1. 1.0 2 NaN 2. 1.0 3 3.0 3. 3.0 4 NaN 4. 3.0
So, in the above example, you can easily see that position 2nd takes the same valeus of position 1st because it is the nearest value to the 1st position. In, short just keep in mind that in 'nearest' interpolation the missing values are filled up by the same value with the help of nearest surrounding values.
In method='nearest', fill_value='extrapolate' you can see in your example it will fill up the last values with the same value of the 5th position. The concept remains the same as for filling up missing values as explained above.
NOTE: Moreover, there are other interpolation methods such as 'bilinear','bicubic', etc. It is all about accuracy for filling up the missing values.
My suggestion is if you want to make a selection from 'nearest' and 'linear' interpolation. I would say go with 'linear' interpolation because it will fill up values more accurately than that of 'nearest' interpolation.
Hopefully, this will help you. Good luck!