Groupby and resample timeseries so date ranges are consistent

萝らか妹 提交于 2021-02-09 10:55:23

问题


I have a dataframe which is basically several timeseries stacked on top of one another. Each time series has a unique label (group) and they have different date ranges.

date = pd.to_datetime(pd.Series(['2010-01-01', '2010-01-02', '2010-01-03', 
                                  '2010-01-06', '2010-01-01', '2010-01-03']))
group = [1,1,1,1, 2, 2]
value = [1,2,3,4,5,6]
df = pd.DataFrame({'date':date, 'group':group, 'value':value})
df
        date   group   value
0 2010-01-01       1       1
1 2010-01-02       1       2
2 2010-01-03       1       3
3 2010-01-06       1       4
4 2010-01-01       2       5
5 2010-01-03       2       6

I would like to resample the data so that there is an entry for every single combination of date and group (padding values to NaN if there was no observation that day or it's outside the date range). Example output would be:

      date   group   value                 
2010-01-01       1       1
2010-01-02       1       2
2010-01-03       1       3
2010-01-04       1       NaN
2010-01-05       1       NaN
2010-01-06       1       4
2010-01-01       2       5
2010-01-02       2       NaN
2010-01-03       2       6
2010-01-04       2       NaN
2010-01-05       2       NaN
2010-01-06       2       NaN

I have a solution which works but I suspect there are better approaches. My solution is to first pivot the data then unstack, groupby and resample. Basically all that's really needed is to do a groupby and resample but specifying the max and min ranges of the resampling with the max and min values of the whole date column but I can't see anyway to do that.

df = (df.pivot(index='dates', columns='groups', values='values')
        .unstack()
        .reset_index()
        .set_index('dates')
        .groupby('groups').resample('D').asfreq()
        .drop('groups', axis=1)
        .reset_index()
        .rename(columns={0:'values'}))[['dates','groups', 'values']]

回答1:


Credit to zipa for getting the dates correct. I've edited my post to correct my mistake.


Set the index then use pandas.MultiIndex.from_product to produce the Cartesian product of values. I also use fill_value=0 to fill in those missing values.

d = df.set_index(['date', 'group'])
midx = pd.MultiIndex.from_product(
    [pd.date_range(df.date.min(), df.date.max()), df.group.unique()],
    names=d.index.names
)
d.reindex(midx, fill_value=0).reset_index()

         date  group  value
0  2010-01-01      1      1
1  2010-01-01      2      5
2  2010-01-02      1      2
3  2010-01-02      2      0
4  2010-01-03      1      3
5  2010-01-03      2      6
6  2010-01-04      1      0
7  2010-01-04      2      0
8  2010-01-05      1      0
9  2010-01-05      2      0
10 2010-01-06      1      4
11 2010-01-06      2      0

Or

d = df.set_index(['date', 'group'])
midx = pd.MultiIndex.from_product(
    [pd.date_range(df.date.min(), df.date.max()), df.group.unique()],
    names=d.index.names
)
d.reindex(midx).reset_index()

         date  group  value
0  2010-01-01      1    1.0
1  2010-01-01      2    5.0
2  2010-01-02      1    2.0
3  2010-01-02      2    NaN
4  2010-01-03      1    3.0
5  2010-01-03      2    6.0
6  2010-01-04      1    NaN
7  2010-01-04      2    NaN
8  2010-01-05      1    NaN
9  2010-01-05      2    NaN
10 2010-01-06      1    4.0
11 2010-01-06      2    NaN

Another dance we could do is a cleaned up version of OP's attempt. Again I use fill_value=0 to fill in missing values. We could leave that out to produce the NaN.

df.set_index(['date', 'group']) \
  .unstack(fill_value=0) \
  .asfreq('D', fill_value=0) \
  .stack().reset_index()

         date  group  value
0  2010-01-01      1      1
1  2010-01-01      2      5
2  2010-01-02      1      2
3  2010-01-02      2      0
4  2010-01-03      1      3
5  2010-01-03      2      6
6  2010-01-04      1      0
7  2010-01-04      2      0
8  2010-01-05      1      0
9  2010-01-05      2      0
10 2010-01-06      1      4
11 2010-01-06      2      0

Or

df.set_index(['date', 'group']) \
  .unstack() \
  .asfreq('D') \
  .stack(dropna=False).reset_index()

         date  group  value
0  2010-01-01      1    1.0
1  2010-01-01      2    5.0
2  2010-01-02      1    2.0
3  2010-01-02      2    NaN
4  2010-01-03      1    3.0
5  2010-01-03      2    6.0
6  2010-01-04      1    NaN
7  2010-01-04      2    NaN
8  2010-01-05      1    NaN
9  2010-01-05      2    NaN
10 2010-01-06      1    4.0
11 2010-01-06      2    NaN



回答2:


Another way:

import pandas as pd
from itertools import product

date = pd.to_datetime(pd.Series(['2010-01-01', '2010-01-02', '2010-01-03', 
                                  '2010-01-06', '2010-01-01', '2010-01-03']))
group = [1,1,1,1, 2, 2]
value = [1,2,3,4,5,6]
df = pd.DataFrame({'date':date, 'group':group, 'value':value})


dates = pd.date_range(df.date.min(), df.date.max())
groups = df.group.unique()
df = (pd.DataFrame(list(product(dates, groups)), columns=['date', 'group'])
            .merge(df, on=['date', 'group'], how='left')
            .sort_values(['group', 'date'])
            .reset_index(drop=True))

df
#         date  group  value
#0  2010-01-01      1    1.0
#1  2010-01-02      1    2.0
#2  2010-01-03      1    3.0
#3  2010-01-04      1    NaN
#4  2010-01-05      1    NaN
#5  2010-01-06      1    4.0
#6  2010-01-01      2    5.0
#7  2010-01-02      2    NaN
#8  2010-01-03      2    6.0
#9  2010-01-04      2    NaN
#10 2010-01-05      2    NaN
#11 2010-01-06      2    NaN


来源:https://stackoverflow.com/questions/50273308/groupby-and-resample-timeseries-so-date-ranges-are-consistent

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!