Update: starting with version 0.20.0, pandas cut/qcut DOES handle date fields. See What\'s New for more.
pd.cut and pd.qcut now sup
How about using Series
and putting the parts of the DataFrame
that you're interested into that, then calling cut
on the series object?
price_series = pd.Series(df.price.tolist(), index=df.recd)
and then
pd.qcut(price_series, q=3)
and so on. (Though I think @Jeff's answer is best)
Just need to set the index of the field you'd like to resample by, here's some examples
In [36]: df.set_index('recd').resample('1M',how='sum')
Out[36]:
price qty
recd
2012-07-31 64.151194 9
2012-08-31 93.476665 7
2012-09-30 94.193027 7
2012-10-31 NaN NaN
2012-11-30 NaN NaN
2012-12-31 12.353405 6
2013-01-31 NaN NaN
2013-02-28 129.586697 7
2013-03-31 NaN NaN
2013-04-30 NaN NaN
2013-05-31 211.979583 13
In [37]: df.set_index('recd').resample('1M',how='count')
Out[37]:
2012-07-31 price 1
qty 1
ship 1
2012-08-31 price 1
qty 1
ship 1
2012-09-30 price 2
qty 2
ship 2
2012-10-31 price 0
qty 0
ship 0
2012-11-30 price 0
qty 0
ship 0
2012-12-31 price 1
qty 1
ship 1
2013-01-31 price 0
qty 0
ship 0
2013-02-28 price 2
qty 2
ship 2
2013-03-31 price 0
qty 0
ship 0
2013-04-30 price 0
qty 0
ship 0
2013-05-31 price 3
qty 3
ship 3
dtype: int64
I came up with an idea that relies on the underlying storage format of datetime64[ns]. If you define dcut() like this
def dcut(dts, freq='d', right=True):
hi = pd.Period(dts.max(), freq=freq) + 1 # get first period past end of data
periods = pd.PeriodIndex(start=dts.min(), end=hi, freq=freq)
# get a list of integer bin boundaries representing ns-since-epoch
# note the extra period gives us the extra right-hand bin boundary we need
bounds = np.array(periods.to_timestamp(how='start'), dtype='int')
# bin our time field as integers
cut = pd.cut(np.array(dts, dtype='int'), bins=bounds, right=right)
# relabel the bins using the periods, omitting the extra one at the end
cut.levels = periods[:-1].format()
return cut
Then we can do what I wanted:
df.groupby([dcut(df.recd, freq='m', right=False),dcut(df.ship, freq='m', right=False)]).count()
To get:
price qty recd ship
2012-07 2012-10 1 1 1 1
2012-11 2012-12 1 1 1 1
2013-03 1 1 1 1
2012-12 2012-09 1 1 1 1
2013-02 1 1 1 1
2013-01 2012-08 1 1 1 1
2013-02 2013-02 1 1 1 1
2013-03 2013-03 1 1 1 1
2013-04 2012-07 1 1 1 1
2013-03 1 1 1 1
I guess you could similarly define dqcut() which first "rounds" each datetime value to the integer representing the start of its containing period (at your specified frequency), and then uses qcut() to choose amongst those boundaries. Or do qcut() first on the raw integer values and round the resulting bins based on your chosen frequency?
No joy on the bonus question yet? :)
Here's a solution using pandas.PeriodIndex (caveat: PeriodIndex doesn't
seem to support time rules with a multiple > 1, such as '4M'). I think
the answer to your bonus question is .size()
.
In [49]: df.groupby([pd.PeriodIndex(df.recd, freq='Q'),
....: pd.PeriodIndex(df.ship, freq='Q'),
....: pd.cut(df['qty'], bins=[0,5,10]),
....: pd.qcut(df['price'],q=2),
....: ]).size()
Out[49]:
qty price
2012Q2 2013Q1 (0, 5] [2, 5] 1
2012Q3 2013Q1 (5, 10] [2, 5] 1
2012Q4 2012Q3 (5, 10] [2, 5] 1
2013Q1 (0, 5] [2, 5] 1
(5, 10] [2, 5] 1
2013Q1 2012Q3 (0, 5] (5, 8] 1
2013Q1 (5, 10] (5, 8] 2
2013Q2 2012Q4 (0, 5] (5, 8] 1
2013Q2 (0, 5] [2, 5] 1