Is it possible in pandas to interpolate for missing values in multiindex dataframe. This example below does not work as expected:
arr1=np.array(np.arange(1.,
So before filling the missing values, this is what you have in the first few rows:
df2
xplusy xtimesy
x y
1 2 3 2
2 2 NaN NaN
4 6 8
It looks like you want to interpolate based on the MultiIndex. I don't believe there is any way to do that with pandas interpolate, but you can do it based on a simple index (method='linear' ignores the index btw and is also the default so no need to specify it either):
df2.reset_index(level=1).interpolate(method='index')
y xplusy xtimesy
x
1 2 3 2
2 2 6 8
2 4 6 8
df2.reset_index(level=0).interpolate(method='index')
x xplusy xtimesy
y
2 1 3.0 2
2 2 3.0 2
4 2 6.0 8
Obviously in this case you could create xplusy and xtimesy in multiple steps (first x, then y, then xplusy and xtimesy) but I'm not sure if that's what you are really trying to do.
Anyway, this is the kind of 1d interpolation you can do pretty easily with pandas interpolate. If that's not enough, you could look into numpy's interp2d for starters.
There are different ways depending on how many rows do you have.
I used to deal with a dataset with 70 million rows on my MAC Pro(16G RAM). I had to group rows by product_id, client_id and week number to caculate customer's demand. Like your example, this dataset does not have every product of every week. So I try these ways:
Find missing week number of every product, fill in and reindex. It takes too much time and memory to return result, even when i split the dataset into several pieces.
Find missing week number of every product, make a new dataframe, and concat with origin dataframe. More efficient, but still using too much time(several hours) and memory.
After all, I find this post on Stackoverflow. I try unstack the week number, fillna with "-9999"(an non-existed number) in the empty weeks and stack it again. After that I replace "-9999" with np.nan, then I get what I want. It just takes several minutes to make it done. I think it's the right way to do it.
As a conclusion, if you have limited resource, "reindex" could just be used on a small dataset (I used the first way to process a piece with 5 million rows, it returns in minutes), besides "unstack/stack" chould works on bigger dataframe.
def multireindex(_df, new_multi_index, method='linear',copy=True):
#from scipy.interpolate import griddata
#import numpy as np
#import pandas as pd
_points=np.array(_df.index.values.tolist())
dfn=dict()
for aclm in _df.columns:
dfn[aclm] = griddata(_points, _df[aclm],
np.array(new_multi_index), method=method)
dfn=pd.DataFrame(dfn,index=pd.MultiIndex.from_tuples(
new_multi_index,names=_df.index.names))
return pd.concat([dfn,_df])
import pandas as pd
import numpy as np
#import numpy.random as npr
#df1=pd.DataFrame(npr.rand(10,5))
arr1=np.random.rand(100)
arr2=np.random.rand(100)
arr1,arr2=[np.round(a*b) for a,b in
zip([arr1,arr2],[100,100,1000])]
df1=pd.DataFrame(zip(arr1,arr2,arr1+arr2,arr1*arr2),columns=['x','y','plus','times'])
df1.set_index(['x','y'],inplace=True)
from scipy.interpolate import griddata
new_points=[(20.0,20.0),(25.0,25.0)]
df2=multireindex(df1,new_points)
df2.head()