pandas multiindex dataframe, ND interpolation for missing values

前端未结

关注

 3  1290

Is it possible in pandas to interpolate for missing values in multiindex dataframe. This example below does not work as expected:

arr1=np.array(np.arange(1.,


                      
              相关标签:


      
      
        
          3条回答        

        
                         				            
            
           
            
                              
                
              
              
                
                  闹比i        
                
              
                            
                2021-01-16 21:05
              
            
            
                                                                       
So before filling the missing values, this is what you have in the first few rows:

df2

      xplusy  xtimesy
x y                  
1 2        3        2
2 2      NaN      NaN
  4        6        8


It looks like you want to interpolate based on the MultiIndex.  I don't believe there is any way to do that with pandas interpolate, but you can do it based on a simple index (method='linear' ignores the index btw and is also the default so no need to specify it either):

df2.reset_index(level=1).interpolate(method='index')

    y  xplusy  xtimesy
x                     
1   2       3        2
2   2       6        8
2   4       6        8

df2.reset_index(level=0).interpolate(method='index')

    x  xplusy  xtimesy
y                     
2   1     3.0        2
2   2     3.0        2
4   2     6.0        8


Obviously in this case you could create xplusy and xtimesy in multiple steps (first x, then y, then xplusy and xtimesy) but I'm not sure if that's what you are really trying to do.

Anyway, this is the kind of 1d interpolation you can do pretty easily with pandas interpolate.  If that's not enough, you could look into numpy's interp2d for starters.
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  粉色の甜心        
                
              
                            
                2021-01-16 21:06
              
            
            
                                                                       
There are different ways depending on how many rows do you have.

I used to deal with a dataset with 70 million rows on my MAC Pro(16G RAM). I had to group rows by product_id, client_id and week number to caculate customer's demand. Like your example, this dataset does not have every product of every week. So I try these ways:


Find missing week number of every product, fill in and reindex. It takes too much time and memory to return result, even when i split the dataset into several pieces.
Find missing week number of every product, make a new dataframe, and concat with origin dataframe. More efficient, but still using too much time(several hours) and memory.
After all, I find this post on Stackoverflow. I try unstack the week number, fillna with "-9999"(an non-existed number) in the empty weeks and stack it again. After that I replace "-9999" with np.nan, then I get what I want. It just takes several minutes to make it done. I think it's the right way to do it.


As a conclusion, if you have limited resource, "reindex" could just be used on a small dataset (I used the first way to process a piece with 5 million rows, it returns in minutes), besides "unstack/stack" chould works on bigger dataframe.
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  不思量自难忘°        
                
              
                            
                2021-01-16 21:22
              
            
            
                                                                       
def multireindex(_df, new_multi_index, method='linear',copy=True):
    #from scipy.interpolate import griddata
    #import numpy as np
    #import pandas as pd
    _points=np.array(_df.index.values.tolist())
    dfn=dict()
    for aclm in _df.columns:
        dfn[aclm] = griddata(_points, _df[aclm], 
                        np.array(new_multi_index), method=method)
    dfn=pd.DataFrame(dfn,index=pd.MultiIndex.from_tuples(
            new_multi_index,names=_df.index.names))
    return pd.concat([dfn,_df])

import pandas as pd
import numpy as np
#import numpy.random as npr
#df1=pd.DataFrame(npr.rand(10,5))
arr1=np.random.rand(100)
arr2=np.random.rand(100)
arr1,arr2=[np.round(a*b) for a,b in 
                zip([arr1,arr2],[100,100,1000])]
df1=pd.DataFrame(zip(arr1,arr2,arr1+arr2,arr1*arr2),columns=['x','y','plus','times'])
df1.set_index(['x','y'],inplace=True)
from scipy.interpolate import griddata
new_points=[(20.0,20.0),(25.0,25.0)]
df2=multireindex(df1,new_points)
df2.head()

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
                             
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复