how to create a group ID based on 5 minutes interval in pandas timeseries?

前端未结

关注

 2  1048

I have a timeseries dataframe df looks like this (the time seris happen within same day, but across different hours:


                      
              相关标签:


      
      
        
          2条回答        

        
                         				            
            
           
            
                              
                
              
              
                
                  忘了有多久        
                
              
                            
                2020-12-08 16:19
              
            
            
                                                                       
Depending on what your doing if I understand the question right can be done a lot more easily just using the resample method

#Get some data
index = pd.DatetimeIndex(start='2013-01-01 00:00', end='2013-01-31 00:00', freq='min')
a = np.random.randint(20, high=30, size=(len(index),1))
b = np.random.randint(14440, high=14449, size=(len(index),1))
df = pd.DataFrame(np.concatenate((a,b), axis=1), index=index, columns=['id','val'])
df.head()


Out[34]:
                     id  val
2013-01-01 00:00:00  20  14446
2013-01-01 00:01:00  25  14443
2013-01-01 00:02:00  25  14448
2013-01-01 00:03:00  20  14445
2013-01-01 00:04:00  28  14442

#Define function for variance
import numpy as np
def pyfun(X):

    if X.shape[0] <= 1:
        result = nan

    else:    
        total = 0
        for x in X:
            total = total + x
        mean = float(total) / X.shape[0]

        total = 0
        for x in X:
            total = total + (mean-x)**2
        result = float(total) / (X.shape[0]-1)

    return result

#Try it out
df.resample('5min', how=pyfun)


Out[53]:
                     id val
2013-01-01 00:00:00  12.3    5.7
2013-01-01 00:05:00  9.3     7.3
2013-01-01 00:10:00  4.7     0.8
2013-01-01 00:15:00  10.8    10.3
2013-01-01 00:20:00  11.5    1.5


Well that was easy. This is for your own functions but if you want to use a function from a library then all you need to do is specify the function in the how keyword

df.resample('5min', how=np.var).head()


Out[54]:
                     id val
2013-01-01 00:00:00  12.3    5.7
2013-01-01 00:05:00  9.3     7.3
2013-01-01 00:10:00  4.7     0.8
2013-01-01 00:15:00  10.8    10.3
2013-01-01 00:20:00  11.5    1.5

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  误落风尘        
                
              
                            
                2020-12-08 16:40
              
            
            
                                                                       
You can use the TimeGrouper function in a groupy/apply. With a TimeGrouper you don't need to create your period column. I know you're not trying to compute the mean but I will use it as an example:

>>> df.groupby(pd.TimeGrouper('5Min'))['val'].mean()

time
2014-04-03 16:00:00    14390.000000
2014-04-03 16:05:00    14394.333333
2014-04-03 16:10:00    14396.500000


Or an example with an explicit apply:

>>> df.groupby(pd.TimeGrouper('5Min'))['val'].apply(lambda x: len(x) > 3)

time
2014-04-03 16:00:00    False
2014-04-03 16:05:00    False
2014-04-03 16:10:00     True


Doctstring for TimeGrouper:

Docstring for resample:class TimeGrouper@21

TimeGrouper(self, freq = 'Min', closed = None, label = None,
how = 'mean', nperiods = None, axis = 0, fill_method = None,
limit = None, loffset = None, kind = None, convention = None, base = 0,
**kwargs)

Custom groupby class for time-interval grouping

Parameters
----------
freq : pandas date offset or offset alias for identifying bin edges
closed : closed end of interval; left or right
label : interval boundary to use for labeling; left or right
nperiods : optional, integer
convention : {'start', 'end', 'e', 's'}
    If axis is PeriodIndex

Notes
-----
Use begin, end, nperiods to generate intervals that cannot be derived
directly from the associated object


Edit

I don't know of an elegant way to create the period column, but the following will work:

>>> new = df.groupby(pd.TimeGrouper('5Min'),as_index=False).apply(lambda x: x['val'])
>>> df['period'] = new.index.get_level_values(0)
>>> df

                     id    val  period
time
2014-04-03 16:01:53  23  14389       0
2014-04-03 16:01:54  28  14391       0 
2014-04-03 16:05:55  24  14393       1
2014-04-03 16:06:25  23  14395       1
2014-04-03 16:07:01  23  14395       1
2014-04-03 16:10:09  23  14395       2
2014-04-03 16:10:23  26  14397       2
2014-04-03 16:10:57  26  14397       2
2014-04-03 16:11:10  26  14397       2


It works because the groupby here with as_index=False actually returns the period column you want as the part of the multiindex and I just grab that part of the multiindex and assign to a new column in the orginal dataframe. You could do anything in the apply, I just want the index:

>>> new

   time
0  2014-04-03 16:01:53    14389
   2014-04-03 16:01:54    14391
1  2014-04-03 16:05:55    14393
   2014-04-03 16:06:25    14395
   2014-04-03 16:07:01    14395
2  2014-04-03 16:10:09    14395
   2014-04-03 16:10:23    14397
   2014-04-03 16:10:57    14397
   2014-04-03 16:11:10    14397

>>>  new.index.get_level_values(0)

Int64Index([0, 0, 1, 1, 1, 2, 2, 2, 2], dtype='int64')

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
                             
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复