Pandas: resample timeseries with groupby

后端 未结 4 745
时光说笑
时光说笑 2020-11-27 12:29

Given the below pandas DataFrame:

In [115]: times = pd.to_datetime(pd.Series([\'2014-08-25 21:00:00\',\'2014-08-25 21:04:00\',
                                       


        
相关标签:
4条回答
  • 2020-11-27 12:34

    In my original post, I suggested using pd.TimeGrouper. Nowadays, use pd.Grouper instead of pd.TimeGrouper. The syntax is largely the same, but TimeGrouper is now deprecated in favor of pd.Grouper.

    Moreover, while pd.TimeGrouper could only group by DatetimeIndex, pd.Grouper can group by datetime columns which you can specify through the key parameter.


    You could use a pd.Grouper to group the DatetimeIndex'ed DataFrame by hour:

    grouper = df.groupby([pd.Grouper(freq='1H'), 'Location'])
    

    use count to count the number of events in each group:

    grouper['Event'].count()
    #                      Location
    # 2014-08-25 21:00:00  HK          1
    #                      LDN         1
    # 2014-08-25 22:00:00  LDN         2
    # Name: Event, dtype: int64
    

    use unstack to move the Location index level to a column level:

    grouper['Event'].count().unstack()
    # Out[49]: 
    # Location             HK  LDN
    # 2014-08-25 21:00:00   1    1
    # 2014-08-25 22:00:00 NaN    2
    

    and then use fillna to change the NaNs into zeros.


    Putting it all together,

    grouper = df.groupby([pd.Grouper(freq='1H'), 'Location'])
    result = grouper['Event'].count().unstack('Location').fillna(0)
    

    yields

    Location             HK  LDN
    2014-08-25 21:00:00   1    1
    2014-08-25 22:00:00   0    2
    
    0 讨论(0)
  • 2020-11-27 12:34

    Multiple Column Group By

    untubu is spot on with his answer but I wanted to add in what you could do if you had a third column, say Cost and wanted to aggregate it like above. It was through combining unutbu's answer and this one that I found out how to do this and thought I would share for future users.

    Create a DataFrame with Cost column:

    In[1]:
    import pandas as pd
    import numpy as np
    times = pd.to_datetime([
        "2014-08-25 21:00:00", "2014-08-25 21:04:00",
        "2014-08-25 22:07:00", "2014-08-25 22:09:00"
    ])
    df = pd.DataFrame({
        "Location": ["HK", "LDN", "LDN", "LDN"],
        "Event":    ["foo", "bar", "baz", "qux"],
        "Cost":     [20, 24, 34, 52]
    }, index = times)
    df
    
    Out[1]:
                         Location  Event  Cost
    2014-08-25 21:00:00        HK    foo    20
    2014-08-25 21:04:00       LDN    bar    24
    2014-08-25 22:07:00       LDN    baz    34
    2014-08-25 22:09:00       LDN    qux    52
    

    Now we group by using the agg function to specify each column's aggregation method, e.g. count, mean, sum, etc.

    In[2]:
    grp = df.groupby([pd.Grouper(freq = "1H"), "Location"]) \
          .agg({"Event": np.size, "Cost": np.mean})
    grp
    
    Out[2]:
                                   Event  Cost
                         Location
    2014-08-25 21:00:00  HK            1    20
                         LDN           1    24
    2014-08-25 22:00:00  LDN           2    43
    

    Then the final unstack with fill NaN with zeros and display as int because it's nice.

    In[3]: 
    grp.unstack().fillna(0).astype(int)
    
    Out[3]:
                        Event     Cost
    Location               HK LDN   HK LDN
    2014-08-25 21:00:00     1   1   20  24
    2014-08-25 22:00:00     0   2    0  43
    
    0 讨论(0)
  • 2020-11-27 12:45

    Pandas 0.21 answer: TimeGrouper is getting deprecated

    There are two options for doing this. They actually can give different results based on your data. The first option groups by Location and within Location groups by hour. The second option groups by Location and hour at the same time.

    Option 1: Use groupby + resample

    grouped = df.groupby('Location').resample('H')['Event'].count()
    

    Option 2: Group both the location and DatetimeIndex together with groupby(pd.Grouper)

    grouped = df.groupby(['Location', pd.Grouper(freq='H')])['Event'].count()
    

    They both will result in the following:

    Location                     
    HK        2014-08-25 21:00:00    1
    LDN       2014-08-25 21:00:00    1
              2014-08-25 22:00:00    2
    Name: Event, dtype: int64
    

    And then reshape:

    grouped.unstack('Location', fill_value=0)
    

    Will output

    Location             HK  LDN
    2014-08-25 21:00:00   1    1
    2014-08-25 22:00:00   0    2
    
    0 讨论(0)
  • 2020-11-27 12:48

    This can be done without using resample or Grouper as follows:

    df.groupby([df.index.floor("1H"), "Location"]).count()

    0 讨论(0)
提交回复
热议问题