Count Number of Rows GroupBy within a GroupBy Between Two Dates in Pandas Dataframe

落爺英雄遲暮 提交于 2020-01-03 02:20:11

问题


I have a dataframe df, which can be created with the following code:

import random
from datetime import timedelta
import pandas as pd
import datetime

#create test range of dates
rng=pd.date_range(datetime.date(2015,7,15),datetime.date(2015,7,31))
rnglist=rng.tolist()
testpts = range(100,121)
#create test dataframe
d={'jid':[i for i in range(100,121)], 
   'cid':[random.randint(1,2) for _ in testpts],
   'ctid':[random.randint(3,4) for _ in testpts],       
    'stdt':[rnglist[random.randint(0,len(rng))] for _ in testpts]}
df=pd.DataFrame(d)[['jid','cid','ctid','stdt']]
df['enddt'] = df['stdt']+timedelta(days=random.randint(2,16))

The df looks like this:

      jid  cid  ctid       stdt      enddt
0   100    1     4 2015-07-28 2015-08-11
1   101    2     3 2015-07-31 2015-08-14
2   102    2     3 2015-07-31 2015-08-14
3   103    1     3 2015-07-24 2015-08-07
4   104    2     4 2015-07-27 2015-08-10
5   105    1     4 2015-07-27 2015-08-10
6   106    2     4 2015-07-24 2015-08-07
7   107    2     3 2015-07-22 2015-08-05
8   108    2     3 2015-07-28 2015-08-11
9   109    1     4 2015-07-20 2015-08-03
10  110    2     3 2015-07-29 2015-08-12
11  111    1     3 2015-07-29 2015-08-12
12  112    1     3 2015-07-27 2015-08-10
13  113    1     3 2015-07-21 2015-08-04
14  114    1     4 2015-07-28 2015-08-11
15  115    2     3 2015-07-28 2015-08-11
16  116    1     3 2015-07-26 2015-08-09
17  117    1     3 2015-07-25 2015-08-08
18  118    2     3 2015-07-26 2015-08-09
19  119    2     3 2015-07-19 2015-08-02
20  120    2     3 2015-07-22 2015-08-05

What I need to do is the following: Count (cnt) the number of jid that occur by ctid by cid, for each date(newdate) between the min(stdt) and max(enddt), where the newdate is between the stdt and the enddt.

That resulting DataFrame should look like (this is just for 1 cid with 1 ctid using above data)(this would replicate in this case for cid 1/ctid 4, cid 2/ctid 3, cid 2/ctid 4):

cid ctid    newdate cnt
1   3   7/21/2015   1
1   3   7/22/2015   1
1   3   7/23/2015   1
1   3   7/24/2015   2
1   3   7/25/2015   3
1   3   7/26/2015   4
1   3   7/27/2015   5
1   3   7/28/2015   5
1   3   7/29/2015   6
1   3   7/30/2015   6
1   3   7/31/2015   6
1   3   8/1/2015    6
1   3   8/2/2015    6
1   3   8/3/2015    6
1   3   8/4/2015    6
1   3   8/5/2015    5
1   3   8/6/2015    5
1   3   8/7/2015    5
1   3   8/8/2015    4
1   3   8/9/2015    3
1   3   8/10/2015   2
1   3   8/11/2015   1
1   3   8/12/2015   1

This previous question (which was also mine) Count # of Rows Between Dates, was very similar, and was answered using pd.melt. I am pretty sure melt can be used again, or maybe there is a better option, but I can't figure out how to get the 'two layer groupby' accomplished which counts the size of jid for each ctid, for each cid, for each newdate. Love your inputs...


回答1:


After trying @Scott Boston answer, for a 1.8m record df, the first line

df_out = pd.concat([pd.DataFrame(index=pd.date_range(df.iloc[i].stdt,df.iloc[i].enddt)).assign(**df.iloc[i,0:3]) for i in pd.np.arange(df.shape[0])]).reset_index()

was still running after 1 hour, and slowly eating away at memory. So I thought I'd try the following:

def reindex_by_date(df):
    dates = pd.date_range(df.index.min(), df.index.max())
    return df.reindex(dates)
def replace_last_0(group):
    group.loc[max(group.index),'change']=0
    return group

def ctidloop(partdf): 
        coid=partdf.cid.max()
        cols=['cid', 'stdt', 'enddt']
        partdf=partdf[cols]
        partdf['jid']=partdf.index
        partdf = pd.melt(partdf, id_vars=['ctid', 'jid'],var_name='change', value_name='newdate')
        partdf['change'] = partdf['change'].replace({'stdt': 1, 'enddt': -1})
        partdf.newdate=pd.DatetimeIndex(partdf['newdate'])
        partdf=partdf.groupby(['ctid', 'newdate'],as_index=False)['change'].sum()
        partdf=partdf.groupby('ctid').apply(replace_last_0).reset_index(drop=True)
        partdf['cnt'] = partdf.groupby('ctid')['change'].cumsum()
        partdf.index=partdf['newdate']
        cols=['ctid', 'change', 'cnt', 'newdate']
        partdf=partdf[cols]
        partdf=partdf.groupby('ctid').apply(reindex_by_date).reset_index(0, drop=True)
        partdf['newdate']=partdf.index
        partdf['ctid']=partdf['ctid'].fillna(method='ffill')
        partdf.cnt=partdf.cnt.fillna(method='ffill')
        partdf.change=partdf.change.fillna(0)
        partdf['cid']=coid
        return partdf
gb=df.groupby('cid').apply(ctidloop)

This code returned the correct result in:

%timeit gb=df.groupby('cid').apply(ctidloop)
1 loop, best of 3: 9.74 s per loop 

EXPLANATION: Basically, melt is very quick. So I figured just break the first groupby up into groups and run a function on it. So This code takes the df, then groupsby the cid and apply the function cidloop.

In the cidloop, the following happens by line: 1) Grab the cid for future use. 2,3) establish core partdf to process by assigning needed columns 4) create jid from the index 5) run the pd.melt which flattens the dataframe by creating a row for each jid for stdt and enddt. 6) creates a 'change' column which assigns +1 to stdt, and -1 to enddt. 7) makes newdate a datetimeindex (just easier for further processing) 8) groups what we have by ctid and newdate, summing the change 9) groups by ctid again, replacing the last value with 0 (this is just something I needed not specific to the problem) 10) creates cnt by group by ctid and cumsumming the change 11)makes the new index from the newdate 12,13) formats columns/names 14) another groupby on ctid but reindexing by hi and low dates, filling the gaps. 15) assign newdate from the new reindex values 16,17,18) fill various values to fill gaps (I needed this enhancement) 19) assign cid again from the top variable coid gathered in line 1.

Do this for each cid through the last line of code gb=df.groupby.....

Thanks @Scott Boston for the attempt. Sure it works but took too long for me.

Kudos to @DSM for his solution HERE which was the basis of my solution.



来源:https://stackoverflow.com/questions/44010314/count-number-of-rows-groupby-within-a-groupby-between-two-dates-in-pandas-datafr

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!