pandas multiindex dataframe, ND interpolation for missing values

前端 未结 3 1290
逝去的感伤
逝去的感伤 2021-01-16 21:00

Is it possible in pandas to interpolate for missing values in multiindex dataframe. This example below does not work as expected:

arr1=np.array(np.arange(1.,         


        
相关标签:
3条回答
  • 2021-01-16 21:05

    So before filling the missing values, this is what you have in the first few rows:

    df2
    
          xplusy  xtimesy
    x y                  
    1 2        3        2
    2 2      NaN      NaN
      4        6        8
    

    It looks like you want to interpolate based on the MultiIndex. I don't believe there is any way to do that with pandas interpolate, but you can do it based on a simple index (method='linear' ignores the index btw and is also the default so no need to specify it either):

    df2.reset_index(level=1).interpolate(method='index')
    
        y  xplusy  xtimesy
    x                     
    1   2       3        2
    2   2       6        8
    2   4       6        8
    
    df2.reset_index(level=0).interpolate(method='index')
    
        x  xplusy  xtimesy
    y                     
    2   1     3.0        2
    2   2     3.0        2
    4   2     6.0        8
    

    Obviously in this case you could create xplusy and xtimesy in multiple steps (first x, then y, then xplusy and xtimesy) but I'm not sure if that's what you are really trying to do.

    Anyway, this is the kind of 1d interpolation you can do pretty easily with pandas interpolate. If that's not enough, you could look into numpy's interp2d for starters.

    0 讨论(0)
  • 2021-01-16 21:06

    There are different ways depending on how many rows do you have.

    I used to deal with a dataset with 70 million rows on my MAC Pro(16G RAM). I had to group rows by product_id, client_id and week number to caculate customer's demand. Like your example, this dataset does not have every product of every week. So I try these ways:

    1. Find missing week number of every product, fill in and reindex. It takes too much time and memory to return result, even when i split the dataset into several pieces.

    2. Find missing week number of every product, make a new dataframe, and concat with origin dataframe. More efficient, but still using too much time(several hours) and memory.

    3. After all, I find this post on Stackoverflow. I try unstack the week number, fillna with "-9999"(an non-existed number) in the empty weeks and stack it again. After that I replace "-9999" with np.nan, then I get what I want. It just takes several minutes to make it done. I think it's the right way to do it.

    As a conclusion, if you have limited resource, "reindex" could just be used on a small dataset (I used the first way to process a piece with 5 million rows, it returns in minutes), besides "unstack/stack" chould works on bigger dataframe.

    0 讨论(0)
  • 2021-01-16 21:22
    def multireindex(_df, new_multi_index, method='linear',copy=True):
        #from scipy.interpolate import griddata
        #import numpy as np
        #import pandas as pd
        _points=np.array(_df.index.values.tolist())
        dfn=dict()
        for aclm in _df.columns:
            dfn[aclm] = griddata(_points, _df[aclm], 
                            np.array(new_multi_index), method=method)
        dfn=pd.DataFrame(dfn,index=pd.MultiIndex.from_tuples(
                new_multi_index,names=_df.index.names))
        return pd.concat([dfn,_df])
    
    import pandas as pd
    import numpy as np
    #import numpy.random as npr
    #df1=pd.DataFrame(npr.rand(10,5))
    arr1=np.random.rand(100)
    arr2=np.random.rand(100)
    arr1,arr2=[np.round(a*b) for a,b in 
                    zip([arr1,arr2],[100,100,1000])]
    df1=pd.DataFrame(zip(arr1,arr2,arr1+arr2,arr1*arr2),columns=['x','y','plus','times'])
    df1.set_index(['x','y'],inplace=True)
    from scipy.interpolate import griddata
    new_points=[(20.0,20.0),(25.0,25.0)]
    df2=multireindex(df1,new_points)
    df2.head()
    
    0 讨论(0)
提交回复
热议问题