问题
I have a dataframe df
, which can be created with the following code:
import random
from datetime import timedelta
import pandas as pd
import datetime
#create test range of dates
rng=pd.date_range(datetime.date(2015,7,15),datetime.date(2015,7,31))
rnglist=rng.tolist()
testpts = range(100,121)
#create test dataframe
d={'jid':[i for i in range(100,121)],
'cid':[random.randint(1,2) for _ in testpts],
'ctid':[random.randint(3,4) for _ in testpts],
'stdt':[rnglist[random.randint(0,len(rng))] for _ in testpts]}
df=pd.DataFrame(d)[['jid','cid','ctid','stdt']]
df['enddt'] = df['stdt']+timedelta(days=random.randint(2,16))
The df
looks like this:
jid cid ctid stdt enddt
0 100 1 4 2015-07-28 2015-08-11
1 101 2 3 2015-07-31 2015-08-14
2 102 2 3 2015-07-31 2015-08-14
3 103 1 3 2015-07-24 2015-08-07
4 104 2 4 2015-07-27 2015-08-10
5 105 1 4 2015-07-27 2015-08-10
6 106 2 4 2015-07-24 2015-08-07
7 107 2 3 2015-07-22 2015-08-05
8 108 2 3 2015-07-28 2015-08-11
9 109 1 4 2015-07-20 2015-08-03
10 110 2 3 2015-07-29 2015-08-12
11 111 1 3 2015-07-29 2015-08-12
12 112 1 3 2015-07-27 2015-08-10
13 113 1 3 2015-07-21 2015-08-04
14 114 1 4 2015-07-28 2015-08-11
15 115 2 3 2015-07-28 2015-08-11
16 116 1 3 2015-07-26 2015-08-09
17 117 1 3 2015-07-25 2015-08-08
18 118 2 3 2015-07-26 2015-08-09
19 119 2 3 2015-07-19 2015-08-02
20 120 2 3 2015-07-22 2015-08-05
What I need to do is the following: Count (
cnt
) the number ofjid
that occur byctid
bycid
, for each date(newdate
) between themin(stdt)
andmax(enddt)
, where thenewdate
is between thestdt
and theenddt
.
That resulting DataFrame should look like (this is just for 1 cid
with 1 ctid
using above data)(this would replicate in this case for cid
1/ctid
4, cid
2/ctid
3, cid
2/ctid
4):
cid ctid newdate cnt
1 3 7/21/2015 1
1 3 7/22/2015 1
1 3 7/23/2015 1
1 3 7/24/2015 2
1 3 7/25/2015 3
1 3 7/26/2015 4
1 3 7/27/2015 5
1 3 7/28/2015 5
1 3 7/29/2015 6
1 3 7/30/2015 6
1 3 7/31/2015 6
1 3 8/1/2015 6
1 3 8/2/2015 6
1 3 8/3/2015 6
1 3 8/4/2015 6
1 3 8/5/2015 5
1 3 8/6/2015 5
1 3 8/7/2015 5
1 3 8/8/2015 4
1 3 8/9/2015 3
1 3 8/10/2015 2
1 3 8/11/2015 1
1 3 8/12/2015 1
This previous question (which was also mine) Count # of Rows Between Dates, was very similar, and was answered using pd.melt
. I am pretty sure melt
can be used again, or maybe there is a better option, but I can't figure out how to get the 'two layer groupby' accomplished which counts the size of jid
for each ctid
, for each cid
, for each newdate
. Love your inputs...
回答1:
After trying @Scott Boston answer, for a 1.8m record df, the first line
df_out = pd.concat([pd.DataFrame(index=pd.date_range(df.iloc[i].stdt,df.iloc[i].enddt)).assign(**df.iloc[i,0:3]) for i in pd.np.arange(df.shape[0])]).reset_index()
was still running after 1 hour, and slowly eating away at memory. So I thought I'd try the following:
def reindex_by_date(df):
dates = pd.date_range(df.index.min(), df.index.max())
return df.reindex(dates)
def replace_last_0(group):
group.loc[max(group.index),'change']=0
return group
def ctidloop(partdf):
coid=partdf.cid.max()
cols=['cid', 'stdt', 'enddt']
partdf=partdf[cols]
partdf['jid']=partdf.index
partdf = pd.melt(partdf, id_vars=['ctid', 'jid'],var_name='change', value_name='newdate')
partdf['change'] = partdf['change'].replace({'stdt': 1, 'enddt': -1})
partdf.newdate=pd.DatetimeIndex(partdf['newdate'])
partdf=partdf.groupby(['ctid', 'newdate'],as_index=False)['change'].sum()
partdf=partdf.groupby('ctid').apply(replace_last_0).reset_index(drop=True)
partdf['cnt'] = partdf.groupby('ctid')['change'].cumsum()
partdf.index=partdf['newdate']
cols=['ctid', 'change', 'cnt', 'newdate']
partdf=partdf[cols]
partdf=partdf.groupby('ctid').apply(reindex_by_date).reset_index(0, drop=True)
partdf['newdate']=partdf.index
partdf['ctid']=partdf['ctid'].fillna(method='ffill')
partdf.cnt=partdf.cnt.fillna(method='ffill')
partdf.change=partdf.change.fillna(0)
partdf['cid']=coid
return partdf
gb=df.groupby('cid').apply(ctidloop)
This code returned the correct result in:
%timeit gb=df.groupby('cid').apply(ctidloop)
1 loop, best of 3: 9.74 s per loop
EXPLANATION: Basically,
melt
is very quick. So I figured just break the firstgroupby
up into groups and run a function on it. So This code takes thedf
, thengroupsby
thecid
andapply
the functioncidloop
.
In the cidloop
, the following happens by line:
1) Grab the cid
for future use.
2,3) establish core partdf
to process by assigning needed columns
4) create jid
from the index
5) run the pd.melt
which flattens the dataframe by creating a row for each jid
for stdt
and enddt
.
6) creates a 'change'
column which assigns +1 to stdt
, and -1 to enddt
.
7) makes newdate
a datetimeindex
(just easier for further processing)
8) groups what we have by ctid
and newdate
, summing the change
9) groups by ctid
again, replacing the last value with 0 (this is just something I needed not specific to the problem)
10) creates cnt
by group by ctid
and cumsumming
the change
11)makes the new index from the newdate
12,13) formats columns/names
14) another groupby on ctid
but reindexing by hi and low dates, filling the gaps.
15) assign newdate
from the new reindex
values
16,17,18) fill various values to fill gaps (I needed this enhancement)
19) assign cid
again from the top variable coid
gathered in line 1.
Do this for each cid
through the last line of code gb=df.groupby.....
Thanks @Scott Boston for the attempt. Sure it works but took too long for me.
Kudos to @DSM for his solution HERE which was the basis of my solution.
来源:https://stackoverflow.com/questions/44010314/count-number-of-rows-groupby-within-a-groupby-between-two-dates-in-pandas-datafr