Groupby and resample timeseries so date ranges are consistent

问题

I have a dataframe which is basically several timeseries stacked on top of one another. Each time series has a unique label (group) and they have different date ranges.

date = pd.to_datetime(pd.Series(['2010-01-01', '2010-01-02', '2010-01-03', 
                                  '2010-01-06', '2010-01-01', '2010-01-03']))
group = [1,1,1,1, 2, 2]
value = [1,2,3,4,5,6]
df = pd.DataFrame({'date':date, 'group':group, 'value':value})
df
        date   group   value
0 2010-01-01       1       1
1 2010-01-02       1       2
2 2010-01-03       1       3
3 2010-01-06       1       4
4 2010-01-01       2       5
5 2010-01-03       2       6

I would like to resample the data so that there is an entry for every single combination of date and group (padding values to NaN if there was no observation that day or it's outside the date range). Example output would be:

      date   group   value                 
2010-01-01       1       1
2010-01-02       1       2
2010-01-03       1       3
2010-01-04       1       NaN
2010-01-05       1       NaN
2010-01-06       1       4
2010-01-01       2       5
2010-01-02       2       NaN
2010-01-03       2       6
2010-01-04       2       NaN
2010-01-05       2       NaN
2010-01-06       2       NaN

I have a solution which works but I suspect there are better approaches. My solution is to first pivot the data then unstack, groupby and resample. Basically all that's really needed is to do a groupby and resample but specifying the max and min ranges of the resampling with the max and min values of the whole date column but I can't see anyway to do that.

df = (df.pivot(index='dates', columns='groups', values='values')
        .unstack()
        .reset_index()
        .set_index('dates')
        .groupby('groups').resample('D').asfreq()
        .drop('groups', axis=1)
        .reset_index()
        .rename(columns={0:'values'}))[['dates','groups', 'values']]

回答1:

Credit to zipa for getting the dates correct. I've edited my post to correct my mistake.

Set the index then use pandas.MultiIndex.from_product to produce the Cartesian product of values. I also use fill_value=0 to fill in those missing values.

d = df.set_index(['date', 'group'])
midx = pd.MultiIndex.from_product(
    [pd.date_range(df.date.min(), df.date.max()), df.group.unique()],
    names=d.index.names
)
d.reindex(midx, fill_value=0).reset_index()

         date  group  value
0  2010-01-01      1      1
1  2010-01-01      2      5
2  2010-01-02      1      2
3  2010-01-02      2      0
4  2010-01-03      1      3
5  2010-01-03      2      6
6  2010-01-04      1      0
7  2010-01-04      2      0
8  2010-01-05      1      0
9  2010-01-05      2      0
10 2010-01-06      1      4
11 2010-01-06      2      0

d = df.set_index(['date', 'group'])
midx = pd.MultiIndex.from_product(
    [pd.date_range(df.date.min(), df.date.max()), df.group.unique()],
    names=d.index.names
)
d.reindex(midx).reset_index()

         date  group  value
0  2010-01-01      1    1.0
1  2010-01-01      2    5.0
2  2010-01-02      1    2.0
3  2010-01-02      2    NaN
4  2010-01-03      1    3.0
5  2010-01-03      2    6.0
6  2010-01-04      1    NaN
7  2010-01-04      2    NaN
8  2010-01-05      1    NaN
9  2010-01-05      2    NaN
10 2010-01-06      1    4.0
11 2010-01-06      2    NaN

Another dance we could do is a cleaned up version of OP's attempt. Again I use fill_value=0 to fill in missing values. We could leave that out to produce the NaN.

df.set_index(['date', 'group']) \
  .unstack(fill_value=0) \
  .asfreq('D', fill_value=0) \
  .stack().reset_index()

         date  group  value
0  2010-01-01      1      1
1  2010-01-01      2      5
2  2010-01-02      1      2
3  2010-01-02      2      0
4  2010-01-03      1      3
5  2010-01-03      2      6
6  2010-01-04      1      0
7  2010-01-04      2      0
8  2010-01-05      1      0
9  2010-01-05      2      0
10 2010-01-06      1      4
11 2010-01-06      2      0

df.set_index(['date', 'group']) \
  .unstack() \
  .asfreq('D') \
  .stack(dropna=False).reset_index()

         date  group  value
0  2010-01-01      1    1.0
1  2010-01-01      2    5.0
2  2010-01-02      1    2.0
3  2010-01-02      2    NaN
4  2010-01-03      1    3.0
5  2010-01-03      2    6.0
6  2010-01-04      1    NaN
7  2010-01-04      2    NaN
8  2010-01-05      1    NaN
9  2010-01-05      2    NaN
10 2010-01-06      1    4.0
11 2010-01-06      2    NaN

回答2:

Another way:

import pandas as pd
from itertools import product

date = pd.to_datetime(pd.Series(['2010-01-01', '2010-01-02', '2010-01-03', 
                                  '2010-01-06', '2010-01-01', '2010-01-03']))
group = [1,1,1,1, 2, 2]
value = [1,2,3,4,5,6]
df = pd.DataFrame({'date':date, 'group':group, 'value':value})


dates = pd.date_range(df.date.min(), df.date.max())
groups = df.group.unique()
df = (pd.DataFrame(list(product(dates, groups)), columns=['date', 'group'])
            .merge(df, on=['date', 'group'], how='left')
            .sort_values(['group', 'date'])
            .reset_index(drop=True))

df
#         date  group  value
#0  2010-01-01      1    1.0
#1  2010-01-02      1    2.0
#2  2010-01-03      1    3.0
#3  2010-01-04      1    NaN
#4  2010-01-05      1    NaN
#5  2010-01-06      1    4.0
#6  2010-01-01      2    5.0
#7  2010-01-02      2    NaN
#8  2010-01-03      2    6.0
#9  2010-01-04      2    NaN
#10 2010-01-05      2    NaN
#11 2010-01-06      2    NaN

来源：https://stackoverflow.com/questions/50273308/groupby-and-resample-timeseries-so-date-ranges-are-consistent

标签

python

pandas

dataframe

time-series

pandas-groupby