Create a Pandas dataframe with counts of items spanning a date range

前端 未结 1 872
野的像风
野的像风 2021-01-13 13:35

I have a DF that has two dates of interest that looks kind of like:

LIST_DATE     END_DATE
2000-04-18    2000-05-17 00:00:00
2000-05-18    2000-09-18 00:00:0         


        
1条回答
  •  旧巷少年郎
    2021-01-13 14:27

    Here's one way to do it, first value_counts the periods in each of the date columns (using the to_period Timestamp method):

    In [11]: p = pd.PeriodIndex(freq='m', start='2000-1', periods=18)
    
    In [12]: starts = df['LIST_DATE'].apply(lambda t: t.to_period(freq='m')).value_counts()
    
    In [13]: ends = df['END_DATE'].apply(lambda t: t.to_period(freq='m')).value_counts()
    

    Reindex these by the PeriodIndex, fill in the NaNs (so you can subtract) and take the cumulative started from the cumulative ended, to give you the currently active:

    In [14]: starts.reindex(p).fillna(0).cumsum() - ends.reindex(p).fillna(0).cumsum()
    Out[14]: 
    2000-01    0
    2000-02    0
    2000-03    0
    2000-04    2
    2000-05    2
    2000-06    2
    2000-07    2
    2000-08    2
    2000-09    1
    2000-10    1
    2000-11    1
    2000-12    1
    2001-01    1
    2001-02    1
    2001-03    1
    2001-04    1
    2001-05    1
    2001-06    0
    Freq: M, dtype: float64
    

    An alternative final step is to create a DataFrame (which initially tracks changes, hence starts is positive and ends negative):

    In [21]: current = pd.DataFrame({'starts': starts, 'ends': -ends}, p)
    
    In [22]: current
    Out[22]:
             ends  starts
    2000-01   NaN     NaN
    2000-02   NaN     NaN
    2000-03   NaN     NaN
    2000-04   NaN       2
    2000-05    -1       1
    2000-06   NaN     NaN
    2000-07   NaN     NaN
    2000-08   NaN     NaN
    2000-09    -1     NaN
    2000-10   NaN     NaN
    2000-11   NaN     NaN
    2000-12   NaN     NaN
    2001-01   NaN     NaN
    2001-02   NaN     NaN
    2001-03   NaN     NaN
    2001-04   NaN     NaN
    2001-05   NaN     NaN
    2001-06    -1     NaN
    
    In [23]: current.fillna(0)
    Out[23]:
             ends  starts
    2000-01     0       0
    2000-02     0       0
    2000-03     0       0
    2000-04     0       2
    2000-05    -1       1
    2000-06     0       0
    2000-07     0       0
    2000-08     0       0
    2000-09    -1       0
    2000-10     0       0
    2000-11     0       0
    2000-12     0       0
    2001-01     0       0
    2001-02     0       0
    2001-03     0       0
    2001-04     0       0
    2001-05     0       0
    2001-06    -1       0 
    

    The cumsum track the running totals of starts and ends up to that point:

    In [24]: current.fillna(0).cumsum()
    Out[24]:
             ends  starts
    2000-01     0       0
    2000-02     0       0
    2000-03     0       0
    2000-04     0       2
    2000-05    -1       3
    2000-06    -1       3
    2000-07    -1       3
    2000-08    -1       3
    2000-09    -2       3
    2000-10    -2       3
    2000-11    -2       3
    2000-12    -2       3
    2001-01    -2       3
    2001-02    -2       3
    2001-03    -2       3
    2001-04    -2       3
    2001-05    -2       3
    2001-06    -3       3 
    

    And summing these columns together, gives those currently active, and is same result as above:

    In [25]: current.fillna(0).cumsum().sum(1)
    

    0 讨论(0)
提交回复
热议问题