how to create a group ID based on 5 minutes interval in pandas timeseries?

前端 未结 2 1047
北海茫月
北海茫月 2020-12-08 15:37

I have a timeseries dataframe df looks like this (the time seris happen within same day, but across different hours:

                                   


        
相关标签:
2条回答
  • 2020-12-08 16:19

    Depending on what your doing if I understand the question right can be done a lot more easily just using the resample method

    #Get some data
    index = pd.DatetimeIndex(start='2013-01-01 00:00', end='2013-01-31 00:00', freq='min')
    a = np.random.randint(20, high=30, size=(len(index),1))
    b = np.random.randint(14440, high=14449, size=(len(index),1))
    df = pd.DataFrame(np.concatenate((a,b), axis=1), index=index, columns=['id','val'])
    df.head()
    
    
    Out[34]:
                         id  val
    2013-01-01 00:00:00  20  14446
    2013-01-01 00:01:00  25  14443
    2013-01-01 00:02:00  25  14448
    2013-01-01 00:03:00  20  14445
    2013-01-01 00:04:00  28  14442
    
    #Define function for variance
    import numpy as np
    def pyfun(X):
    
        if X.shape[0] <= 1:
            result = nan
    
        else:    
            total = 0
            for x in X:
                total = total + x
            mean = float(total) / X.shape[0]
    
            total = 0
            for x in X:
                total = total + (mean-x)**2
            result = float(total) / (X.shape[0]-1)
    
        return result
    
    #Try it out
    df.resample('5min', how=pyfun)
    
    
    Out[53]:
                         id val
    2013-01-01 00:00:00  12.3    5.7
    2013-01-01 00:05:00  9.3     7.3
    2013-01-01 00:10:00  4.7     0.8
    2013-01-01 00:15:00  10.8    10.3
    2013-01-01 00:20:00  11.5    1.5
    

    Well that was easy. This is for your own functions but if you want to use a function from a library then all you need to do is specify the function in the how keyword

    df.resample('5min', how=np.var).head()
    
    
    Out[54]:
                         id val
    2013-01-01 00:00:00  12.3    5.7
    2013-01-01 00:05:00  9.3     7.3
    2013-01-01 00:10:00  4.7     0.8
    2013-01-01 00:15:00  10.8    10.3
    2013-01-01 00:20:00  11.5    1.5
    
    0 讨论(0)
  • 2020-12-08 16:40

    You can use the TimeGrouper function in a groupy/apply. With a TimeGrouper you don't need to create your period column. I know you're not trying to compute the mean but I will use it as an example:

    >>> df.groupby(pd.TimeGrouper('5Min'))['val'].mean()
    
    time
    2014-04-03 16:00:00    14390.000000
    2014-04-03 16:05:00    14394.333333
    2014-04-03 16:10:00    14396.500000
    

    Or an example with an explicit apply:

    >>> df.groupby(pd.TimeGrouper('5Min'))['val'].apply(lambda x: len(x) > 3)
    
    time
    2014-04-03 16:00:00    False
    2014-04-03 16:05:00    False
    2014-04-03 16:10:00     True
    

    Doctstring for TimeGrouper:

    Docstring for resample:class TimeGrouper@21
    
    TimeGrouper(self, freq = 'Min', closed = None, label = None,
    how = 'mean', nperiods = None, axis = 0, fill_method = None,
    limit = None, loffset = None, kind = None, convention = None, base = 0,
    **kwargs)
    
    Custom groupby class for time-interval grouping
    
    Parameters
    ----------
    freq : pandas date offset or offset alias for identifying bin edges
    closed : closed end of interval; left or right
    label : interval boundary to use for labeling; left or right
    nperiods : optional, integer
    convention : {'start', 'end', 'e', 's'}
        If axis is PeriodIndex
    
    Notes
    -----
    Use begin, end, nperiods to generate intervals that cannot be derived
    directly from the associated object
    

    Edit

    I don't know of an elegant way to create the period column, but the following will work:

    >>> new = df.groupby(pd.TimeGrouper('5Min'),as_index=False).apply(lambda x: x['val'])
    >>> df['period'] = new.index.get_level_values(0)
    >>> df
    
                         id    val  period
    time
    2014-04-03 16:01:53  23  14389       0
    2014-04-03 16:01:54  28  14391       0 
    2014-04-03 16:05:55  24  14393       1
    2014-04-03 16:06:25  23  14395       1
    2014-04-03 16:07:01  23  14395       1
    2014-04-03 16:10:09  23  14395       2
    2014-04-03 16:10:23  26  14397       2
    2014-04-03 16:10:57  26  14397       2
    2014-04-03 16:11:10  26  14397       2
    

    It works because the groupby here with as_index=False actually returns the period column you want as the part of the multiindex and I just grab that part of the multiindex and assign to a new column in the orginal dataframe. You could do anything in the apply, I just want the index:

    >>> new
    
       time
    0  2014-04-03 16:01:53    14389
       2014-04-03 16:01:54    14391
    1  2014-04-03 16:05:55    14393
       2014-04-03 16:06:25    14395
       2014-04-03 16:07:01    14395
    2  2014-04-03 16:10:09    14395
       2014-04-03 16:10:23    14397
       2014-04-03 16:10:57    14397
       2014-04-03 16:11:10    14397
    
    >>>  new.index.get_level_values(0)
    
    Int64Index([0, 0, 1, 1, 1, 2, 2, 2, 2], dtype='int64')
    
    0 讨论(0)
提交回复
热议问题