What's the equivalent of cut/qcut for pandas date fields?

后端 未结 4 1747
Happy的楠姐
Happy的楠姐 2020-12-31 20:13

Update: starting with version 0.20.0, pandas cut/qcut DOES handle date fields. See What\'s New for more.

pd.cut and pd.qcut now sup

相关标签:
4条回答
  • 2020-12-31 20:43

    How about using Series and putting the parts of the DataFrame that you're interested into that, then calling cut on the series object?

    price_series = pd.Series(df.price.tolist(), index=df.recd)
    

    and then

     pd.qcut(price_series, q=3)
    

    and so on. (Though I think @Jeff's answer is best)

    0 讨论(0)
  • 2020-12-31 20:52

    Just need to set the index of the field you'd like to resample by, here's some examples

    In [36]: df.set_index('recd').resample('1M',how='sum')
    Out[36]: 
                     price  qty
    recd                       
    2012-07-31   64.151194    9
    2012-08-31   93.476665    7
    2012-09-30   94.193027    7
    2012-10-31         NaN  NaN
    2012-11-30         NaN  NaN
    2012-12-31   12.353405    6
    2013-01-31         NaN  NaN
    2013-02-28  129.586697    7
    2013-03-31         NaN  NaN
    2013-04-30         NaN  NaN
    2013-05-31  211.979583   13
    
    In [37]: df.set_index('recd').resample('1M',how='count')
    Out[37]: 
    2012-07-31  price    1
                qty      1
                ship     1
    2012-08-31  price    1
                qty      1
                ship     1
    2012-09-30  price    2
                qty      2
                ship     2
    2012-10-31  price    0
                qty      0
                ship     0
    2012-11-30  price    0
                qty      0
                ship     0
    2012-12-31  price    1
                qty      1
                ship     1
    2013-01-31  price    0
                qty      0
                ship     0
    2013-02-28  price    2
                qty      2
                ship     2
    2013-03-31  price    0
                qty      0
                ship     0
    2013-04-30  price    0
                qty      0
                ship     0
    2013-05-31  price    3
                qty      3
                ship     3
    dtype: int64
    
    0 讨论(0)
  • 2020-12-31 20:59

    I came up with an idea that relies on the underlying storage format of datetime64[ns]. If you define dcut() like this

    def dcut(dts, freq='d', right=True):
        hi = pd.Period(dts.max(), freq=freq) + 1   # get first period past end of data
        periods = pd.PeriodIndex(start=dts.min(), end=hi, freq=freq)
        # get a list of integer bin boundaries representing ns-since-epoch
        # note the extra period gives us the extra right-hand bin boundary we need
        bounds = np.array(periods.to_timestamp(how='start'), dtype='int')
        # bin our time field as integers
        cut = pd.cut(np.array(dts, dtype='int'), bins=bounds, right=right)
        # relabel the bins using the periods, omitting the extra one at the end
        cut.levels = periods[:-1].format()
        return cut
    

    Then we can do what I wanted:

    df.groupby([dcut(df.recd, freq='m', right=False),dcut(df.ship, freq='m', right=False)]).count()
    

    To get:

                    price qty recd ship
    2012-07 2012-10   1    1    1    1
    2012-11 2012-12   1    1    1    1
            2013-03   1    1    1    1  
    2012-12 2012-09   1    1    1    1
            2013-02   1    1    1    1  
    2013-01 2012-08   1    1    1    1
    2013-02 2013-02   1    1    1    1
    2013-03 2013-03   1    1    1    1
    2013-04 2012-07   1    1    1    1
            2013-03   1    1    1    1  
    

    I guess you could similarly define dqcut() which first "rounds" each datetime value to the integer representing the start of its containing period (at your specified frequency), and then uses qcut() to choose amongst those boundaries. Or do qcut() first on the raw integer values and round the resulting bins based on your chosen frequency?

    No joy on the bonus question yet? :)

    0 讨论(0)
  • 2020-12-31 21:10

    Here's a solution using pandas.PeriodIndex (caveat: PeriodIndex doesn't seem to support time rules with a multiple > 1, such as '4M'). I think the answer to your bonus question is .size().

    In [49]: df.groupby([pd.PeriodIndex(df.recd, freq='Q'),
       ....:             pd.PeriodIndex(df.ship, freq='Q'),
       ....:             pd.cut(df['qty'], bins=[0,5,10]),
       ....:             pd.qcut(df['price'],q=2),
       ....:            ]).size()
    Out[49]: 
                    qty      price 
    2012Q2  2013Q1  (0, 5]   [2, 5]    1
    2012Q3  2013Q1  (5, 10]  [2, 5]    1
    2012Q4  2012Q3  (5, 10]  [2, 5]    1
            2013Q1  (0, 5]   [2, 5]    1
                    (5, 10]  [2, 5]    1
    2013Q1  2012Q3  (0, 5]   (5, 8]    1
            2013Q1  (5, 10]  (5, 8]    2
    2013Q2  2012Q4  (0, 5]   (5, 8]    1
            2013Q2  (0, 5]   [2, 5]    1
    
    0 讨论(0)
提交回复
热议问题