问题
I have a dataframe which is basically several timeseries stacked on top of one another. Each time series has a unique label (group) and they have different date ranges.
date = pd.to_datetime(pd.Series(['2010-01-01', '2010-01-02', '2010-01-03',
'2010-01-06', '2010-01-01', '2010-01-03']))
group = [1,1,1,1, 2, 2]
value = [1,2,3,4,5,6]
df = pd.DataFrame({'date':date, 'group':group, 'value':value})
df
date group value
0 2010-01-01 1 1
1 2010-01-02 1 2
2 2010-01-03 1 3
3 2010-01-06 1 4
4 2010-01-01 2 5
5 2010-01-03 2 6
I would like to resample the data so that there is an entry for every single combination of date and group (padding values to NaN if there was no observation that day or it's outside the date range). Example output would be:
date group value
2010-01-01 1 1
2010-01-02 1 2
2010-01-03 1 3
2010-01-04 1 NaN
2010-01-05 1 NaN
2010-01-06 1 4
2010-01-01 2 5
2010-01-02 2 NaN
2010-01-03 2 6
2010-01-04 2 NaN
2010-01-05 2 NaN
2010-01-06 2 NaN
I have a solution which works but I suspect there are better approaches. My solution is to first pivot the data then unstack, groupby and resample. Basically all that's really needed is to do a groupby and resample but specifying the max and min ranges of the resampling with the max and min values of the whole date column but I can't see anyway to do that.
df = (df.pivot(index='dates', columns='groups', values='values')
.unstack()
.reset_index()
.set_index('dates')
.groupby('groups').resample('D').asfreq()
.drop('groups', axis=1)
.reset_index()
.rename(columns={0:'values'}))[['dates','groups', 'values']]
回答1:
Credit to zipa for getting the dates correct. I've edited my post to correct my mistake.
Set the index then use pandas.MultiIndex.from_product
to produce the Cartesian product of values. I also use fill_value=0
to fill in those missing values.
d = df.set_index(['date', 'group'])
midx = pd.MultiIndex.from_product(
[pd.date_range(df.date.min(), df.date.max()), df.group.unique()],
names=d.index.names
)
d.reindex(midx, fill_value=0).reset_index()
date group value
0 2010-01-01 1 1
1 2010-01-01 2 5
2 2010-01-02 1 2
3 2010-01-02 2 0
4 2010-01-03 1 3
5 2010-01-03 2 6
6 2010-01-04 1 0
7 2010-01-04 2 0
8 2010-01-05 1 0
9 2010-01-05 2 0
10 2010-01-06 1 4
11 2010-01-06 2 0
Or
d = df.set_index(['date', 'group'])
midx = pd.MultiIndex.from_product(
[pd.date_range(df.date.min(), df.date.max()), df.group.unique()],
names=d.index.names
)
d.reindex(midx).reset_index()
date group value
0 2010-01-01 1 1.0
1 2010-01-01 2 5.0
2 2010-01-02 1 2.0
3 2010-01-02 2 NaN
4 2010-01-03 1 3.0
5 2010-01-03 2 6.0
6 2010-01-04 1 NaN
7 2010-01-04 2 NaN
8 2010-01-05 1 NaN
9 2010-01-05 2 NaN
10 2010-01-06 1 4.0
11 2010-01-06 2 NaN
Another dance we could do is a cleaned up version of OP's attempt. Again I use fill_value=0
to fill in missing values. We could leave that out to produce the NaN
.
df.set_index(['date', 'group']) \
.unstack(fill_value=0) \
.asfreq('D', fill_value=0) \
.stack().reset_index()
date group value
0 2010-01-01 1 1
1 2010-01-01 2 5
2 2010-01-02 1 2
3 2010-01-02 2 0
4 2010-01-03 1 3
5 2010-01-03 2 6
6 2010-01-04 1 0
7 2010-01-04 2 0
8 2010-01-05 1 0
9 2010-01-05 2 0
10 2010-01-06 1 4
11 2010-01-06 2 0
Or
df.set_index(['date', 'group']) \
.unstack() \
.asfreq('D') \
.stack(dropna=False).reset_index()
date group value
0 2010-01-01 1 1.0
1 2010-01-01 2 5.0
2 2010-01-02 1 2.0
3 2010-01-02 2 NaN
4 2010-01-03 1 3.0
5 2010-01-03 2 6.0
6 2010-01-04 1 NaN
7 2010-01-04 2 NaN
8 2010-01-05 1 NaN
9 2010-01-05 2 NaN
10 2010-01-06 1 4.0
11 2010-01-06 2 NaN
回答2:
Another way:
import pandas as pd
from itertools import product
date = pd.to_datetime(pd.Series(['2010-01-01', '2010-01-02', '2010-01-03',
'2010-01-06', '2010-01-01', '2010-01-03']))
group = [1,1,1,1, 2, 2]
value = [1,2,3,4,5,6]
df = pd.DataFrame({'date':date, 'group':group, 'value':value})
dates = pd.date_range(df.date.min(), df.date.max())
groups = df.group.unique()
df = (pd.DataFrame(list(product(dates, groups)), columns=['date', 'group'])
.merge(df, on=['date', 'group'], how='left')
.sort_values(['group', 'date'])
.reset_index(drop=True))
df
# date group value
#0 2010-01-01 1 1.0
#1 2010-01-02 1 2.0
#2 2010-01-03 1 3.0
#3 2010-01-04 1 NaN
#4 2010-01-05 1 NaN
#5 2010-01-06 1 4.0
#6 2010-01-01 2 5.0
#7 2010-01-02 2 NaN
#8 2010-01-03 2 6.0
#9 2010-01-04 2 NaN
#10 2010-01-05 2 NaN
#11 2010-01-06 2 NaN
来源:https://stackoverflow.com/questions/50273308/groupby-and-resample-timeseries-so-date-ranges-are-consistent