问题
I am researching prescription habits and have large dataframes of sold products.
I am trying to transform purchases of medications into courses of the drugs by calculating how long the product would have lasted and adding a 5 day fudge factor for compliance, starting delays, etc to calculate an end date for the purchase.
I then want to combine prescriptions with overlapping date windows but I'm struggling to find an efficient way to do this. I was hoping a groupby would be possible but I can't figure out how to do this.
I know how to iterate over the dataframe to create a new dataframe with the relevant information, but it is a slow operation and I am hoping I can find a more elegant solution.
ID start end ingredient days dose end
1000 2018-10-03 2018-10-18 Metron... 10.0 125.00
1000 2018-10-13 2018-10-25 Metron... 7.0 125.00
1001 2018-03-08 2018-03-20 Cefalexin 7.0 150.00
1001 2018-09-17 2018-10-05 Cefalexin 13.0 150.00
1002 2018-05-18 2018-05-30 Amoxiclav 7.0 75.00
1002 2018-05-25 2018-06-06 Amoxiclav 7.0 100.00
1003 2018-07-01 2018-07-16 Amoxiclav 10.0 50.00
1003 2018-07-15 2018-07-30 Amoxiclav 10.0 50.00
1003 2018-07-25 2018-08-09 Amoxiclav 10.0 50.00
My expected result is as follows:
ID start end ingredient days dose
1000 2018-10-03 2018-10-25 Metron... 17.0 125.00
1001 2018-03-08 2018-03-20 Cefalexin 7.0 150.00
1001 2018-09-17 2018-10-05 Cefalexin 13.0 150.00
1002 2018-05-18 2018-05-30 Amoxiclav 7.0 75.00
1002 2018-05-25 2018-06-06 Amoxiclav 7.0 100.00
1003 2018-07-01 2018-08-05 Amoxiclav 30.0 50.00
1000
's second purchase was exactly 10 days in so the end date is the same as their second end date.
1001
did not overlap so remains as they are.
1002
overlaps on start and end dates but had a change in their dose so should not be combined.
1003
had 30 days worth in total. The start date of their final purchase is later than the end date of the first. Their end date should be 35 days after they first made a purchase. This is a negotiable criterion and an end date matching the final purchase's end date would be acceptable.
Am I barking up the wrong tree here? Must this be done iteratively?
回答1:
I think the biggest problem here is to identify when the time intervals are overlapping, the rest is just grouping and addition.
First, be sure (if not already done) to convert your dates to datetime
and days in timedelta
. This will help to compare dates and durations and perform some math on them.
df['start'] = pd.to_datetime(df['start'])
df['end'] = pd.to_datetime(df['end'])
df['days'] = pd.to_timedelta(df['days'], unit='D')
This code produces your expected result:
def join_times(x):
startdf = pd.DataFrame({'time':x['start'], 'what':1})
enddf = pd.DataFrame({'time':x['end'], 'what':-1})
mergdf = pd.concat([startdf, enddf]).sort_values('time')
mergdf['running'] = mergdf['what'].cumsum()
mergdf['newwin'] = mergdf['running'].eq(1) & mergdf['what'].eq(1)
mergdf['group'] = mergdf['newwin'].cumsum()
x['group'] = mergdf['group'].loc[mergdf['what'].eq(1)]
res = x.groupby('group').agg({'days':'sum', 'start':'first'})
res['end'] = res.apply(lambda x : x['start'] + x['days'] + pd.to_timedelta(5, unit='D'), axis=1)
return res
ddf = df.groupby(['ID', 'ingredient', 'dose']).apply(join_times).reset_index().drop('group', axis=1)
This needs to be explained. As you see, I use groupby
to identify the subsamples. Then the job is done by the custom join_times
function.
join_times
function joins together in the same column of a single dataframe (column 'time'
) start and end times, sorted in order.
A second column 'what'
marks with +1 starting times, and with -1 ending time. These are used to keep track of how many intervals are overlapping (in colum 'running'
using cumsum()
).
Then a boolean column 'newwin'
is buildt to identify the beginning of a new non overlapping time interval and a column 'group'
is buildt to mark with the same integer the rows belonging to the same overlapping time interval.
The a 'group'
column is added to the original subsample, copying the values in the previously buildt 'group'
column. Finally, we can identify, for each subsample, which rows have overlapping.
So we can use groupby
again and sum the 'days'
column, keeping the first date from 'start'
column.'end'
column is calculated by adding to 'start'
the duration 'days'
plus 5 days.
The above code, using your data sample, gives:
ID ingredient dose days start end
0 1000 Metron... 125.0 17 days 2018-10-03 2018-10-25
1 1001 Cefalexin 150.0 7 days 2018-03-08 2018-03-20
2 1001 Cefalexin 150.0 13 days 2018-09-17 2018-10-05
3 1002 Amoxiclav 75.0 7 days 2018-05-18 2018-05-30
4 1002 Amoxiclav 100.0 7 days 2018-05-25 2018-06-06
5 1003 Amoxiclav 50.0 30 days 2018-07-01 2018-08-05
Which is your expected result. Column order is different due to groupby
operations with indexing.
来源:https://stackoverflow.com/questions/57804145/combining-rows-with-overlapping-time-periods-in-a-pandas-dataframe