问题
I have the following dataframe.
df = pd.DataFrame(
{'name' : ['Khan','Khan','Khan','Dean','Dean','Dean'],
'start_date' : ['01-01-2020','04-02-2020','02-03-2020','09-04-2020','06-08-2020','12-12-2020'],
'end_date' : ['03-01-2020', '09-02-2020','02-03-2020','15-05-2020','19-08-2020','31-12-2020'],
'holiday_type' : ['holiday','holiday','sick leave','holiday','holiday','sick leave']
} )
df[['start_date','end_date']] = df[['start_date','end_date']].apply(pd.to_datetime,format='%d-%m-%Y')
print(df)
name start_date end_date holiday_type
0 Khan 2020-01-01 2020-01-03 holiday
1 Khan 2020-02-04 2020-02-09 holiday
2 Khan 2020-03-02 2020-03-02 sick leave
3 Dean 2020-04-09 2020-05-15 holiday
4 Dean 2020-08-06 2020-08-19 holiday
5 Dean 2020-12-12 2020-12-31 sick leave
I'm attempting to re-sample the data by start and end date without over lapping i.e i don't want a table for Khan that starts on the 2020-01-02
and ends on 2020-03-02
my own attempt as been to melt, index and use groupby.resample
however I'm unsure how to only group by each start and end date, a cumulative count ? but it doesn't seem very efficent either.
ideal output :
name date_range holiday_type
0 Khan 2020-01-01 holiday
1 Khan 2020-02-01 holiday
2 Khan 2020-03-01 holiday # < end of holiday 1
3 Khan 2020-04-02 holiday
4 Khan 2020-05-02 holiday
5 Khan 2020-06-02 holiday
6 Khan 2020-07-02 holiday
7 Khan 2020-08-02 holiday
8 Khan 2020-09-02 holiday # end of holiday 2
9 Khan 2020-02-03 sick leave # one day sick leave, can also have one day holiday.
10 Dean 2020-09-04 holiday
11 Dean 2020-10-04 holiday
12 Dean 2020-11-04 holiday
13 Dean 2020-12-04 holiday
14 Dean 2020-04-13 holiday
15 Dean 2020-04-14 holiday
16 Dean 2020-04-15 holiday
17 Dean 2020-06-08 holiday
18 Dean 2020-07-08 holiday
19 Dean 2020-08-08 holiday
20 Dean 2020-09-08 holiday
21 Dean 2020-10-08 holiday
22 Dean 2020-11-08 holiday
23 Dean 2020-12-08 holiday
24 Dean 2020-08-13 holiday
25 Dean 2020-08-14 holiday
26 Dean 2020-08-15 holiday
27 Dean 2020-08-16 holiday
28 Dean 2020-08-17 holiday
29 Dean 2020-08-18 holiday
30 Dean 2020-08-19 holiday
31 Dean 2020-12-12 sick leave
32 Dean 2020-12-13 sick leave
33 Dean 2020-12-14 sick leave
34 Dean 2020-12-15 sick leave
35 Dean 2020-12-16 sick leave
36 Dean 2020-12-17 sick leave
37 Dean 2020-12-18 sick leave
38 Dean 2020-12-19 sick leave
39 Dean 2020-12-20 sick leave
40 Dean 2020-12-21 sick leave
41 Dean 2020-12-22 sick leave
42 Dean 2020-12-23 sick leave
43 Dean 2020-12-24 sick leave
44 Dean 2020-12-25 sick leave
45 Dean 2020-12-26 sick leave
46 Dean 2020-12-27 sick leave
47 Dean 2020-12-28 sick leave
48 Dean 2020-12-29 sick leave
49 Dean 2020-12-30 sick leave
50 Dean 2020-12-31 sick leave
ideal output as dict.
{'name': {0: 'Khan', 1: 'Khan', 2: 'Khan', 3: 'Khan', 4: 'Khan', 5: 'Khan', 6: 'Khan', 7: 'Khan', 8: 'Khan', 9: 'Khan', 10: 'Dean', 11: 'Dean', 12: 'Dean', 13: 'Dean', 14: 'Dean', 15: 'Dean', 16: 'Dean', 17: 'Dean', 18: 'Dean', 19: 'Dean', 20: 'Dean', 21: 'Dean', 22: 'Dean', 23: 'Dean', 24: 'Dean', 25: 'Dean', 26: 'Dean', 27: 'Dean', 28: 'Dean', 29: 'Dean', 30: 'Dean', 31: 'Dean', 32: 'Dean', 33: 'Dean', 34: 'Dean', 35: 'Dean', 36: 'Dean', 37: 'Dean', 38: 'Dean', 39: 'Dean', 40: 'Dean', 41: 'Dean', 42: 'Dean', 43: 'Dean', 44: 'Dean', 45: 'Dean', 46: 'Dean', 47: 'Dean', 48: 'Dean', 49: 'Dean', 50: 'Dean'}, 'date_range': {0: Timestamp('2020-01-01 00:00:00'), 1: Timestamp('2020-02-01 00:00:00'), 2: Timestamp('2020-03-01 00:00:00'), 3: Timestamp('2020-04-02 00:00:00'), 4: Timestamp('2020-05-02 00:00:00'), 5: Timestamp('2020-06-02 00:00:00'), 6: Timestamp('2020-07-02 00:00:00'), 7: Timestamp('2020-08-02 00:00:00'), 8: Timestamp('2020-09-02 00:00:00'), 9: Timestamp('2020-02-03 00:00:00'), 10: Timestamp('2020-09-04 00:00:00'), 11: Timestamp('2020-10-04 00:00:00'), 12: Timestamp('2020-11-04 00:00:00'), 13: Timestamp('2020-12-04 00:00:00'), 14: Timestamp('2020-04-13 00:00:00'), 15: Timestamp('2020-04-14 00:00:00'), 16: Timestamp('2020-04-15 00:00:00'), 17: Timestamp('2020-06-08 00:00:00'), 18: Timestamp('2020-07-08 00:00:00'), 19: Timestamp('2020-08-08 00:00:00'), 20: Timestamp('2020-09-08 00:00:00'), 21: Timestamp('2020-10-08 00:00:00'), 22: Timestamp('2020-11-08 00:00:00'), 23: Timestamp('2020-12-08 00:00:00'), 24: Timestamp('2020-08-13 00:00:00'), 25: Timestamp('2020-08-14 00:00:00'), 26: Timestamp('2020-08-15 00:00:00'), 27: Timestamp('2020-08-16 00:00:00'), 28: Timestamp('2020-08-17 00:00:00'), 29: Timestamp('2020-08-18 00:00:00'), 30: Timestamp('2020-08-19 00:00:00'), 31: Timestamp('2020-12-12 00:00:00'), 32: Timestamp('2020-12-13 00:00:00'), 33: Timestamp('2020-12-14 00:00:00'), 34: Timestamp('2020-12-15 00:00:00'), 35: Timestamp('2020-12-16 00:00:00'), 36: Timestamp('2020-12-17 00:00:00'), 37: Timestamp('2020-12-18 00:00:00'), 38: Timestamp('2020-12-19 00:00:00'), 39: Timestamp('2020-12-20 00:00:00'), 40: Timestamp('2020-12-21 00:00:00'), 41: Timestamp('2020-12-22 00:00:00'), 42: Timestamp('2020-12-23 00:00:00'), 43: Timestamp('2020-12-24 00:00:00'), 44: Timestamp('2020-12-25 00:00:00'), 45: Timestamp('2020-12-26 00:00:00'), 46: Timestamp('2020-12-27 00:00:00'), 47: Timestamp('2020-12-28 00:00:00'), 48: Timestamp('2020-12-29 00:00:00'), 49: Timestamp('2020-12-30 00:00:00'), 50: Timestamp('2020-12-31 00:00:00')}, 'holiday_type': {0: 'holiday', 1: 'holiday', 2: 'holiday', 3: 'holiday', 4: 'holiday', 5: 'holiday', 6: 'holiday', 7: 'holiday', 8: 'holiday', 9: 'sick leave', 10: 'holiday', 11: 'holiday', 12: 'holiday', 13: 'holiday', 14: 'holiday', 15: 'holiday', 16: 'holiday', 17: 'holiday', 18: 'holiday', 19: 'holiday', 20: 'holiday', 21: 'holiday', 22: 'holiday', 23: 'holiday', 24: 'holiday', 25: 'holiday', 26: 'holiday', 27: 'holiday', 28: 'holiday', 29: 'holiday', 30: 'holiday', 31: 'sick leave', 32: 'sick leave', 33: 'sick leave', 34: 'sick leave', 35: 'sick leave', 36: 'sick leave', 37: 'sick leave', 38: 'sick leave', 39: 'sick leave', 40: 'sick leave', 41: 'sick leave', 42: 'sick leave', 43: 'sick leave', 44: 'sick leave', 45: 'sick leave', 46: 'sick leave', 47: 'sick leave', 48: 'sick leave', 49: 'sick leave', 50: 'sick leave'}}
回答1:
IIUC,
df_out = (df.set_index(['name','holiday_type'])
.apply(lambda x: pd.date_range(x['start_date'], x['end_date']), axis=1)
.explode().rename('date').reset_index())
Output:
name holiday_type date
0 Khan holiday 2020-01-01
1 Khan holiday 2020-01-02
2 Khan holiday 2020-01-03
3 Khan holiday 2020-02-04
4 Khan holiday 2020-02-05
.. ... ... ...
76 Dean sick leave 2020-12-27
77 Dean sick leave 2020-12-28
78 Dean sick leave 2020-12-29
79 Dean sick leave 2020-12-30
80 Dean sick leave 2020-12-31
[81 rows x 3 columns]
Dictionary output:
df_out.to_dict()
Output:
{'name': {0: 'Khan',
1: 'Khan',
2: 'Khan',
3: 'Khan',
4: 'Khan',
5: 'Khan',
6: 'Khan',
7: 'Khan',
8: 'Khan',
9: 'Khan',
10: 'Dean',
11: 'Dean',
12: 'Dean',
13: 'Dean',
14: 'Dean',
15: 'Dean',
16: 'Dean',
17: 'Dean',
18: 'Dean',
19: 'Dean',
20: 'Dean',
21: 'Dean',
22: 'Dean',
23: 'Dean',
24: 'Dean',
25: 'Dean',
26: 'Dean',
27: 'Dean',
28: 'Dean',
29: 'Dean',
30: 'Dean',
31: 'Dean',
32: 'Dean',
33: 'Dean',
34: 'Dean',
35: 'Dean',
36: 'Dean',
37: 'Dean',
38: 'Dean',
39: 'Dean',
40: 'Dean',
41: 'Dean',
42: 'Dean',
43: 'Dean',
44: 'Dean',
45: 'Dean',
46: 'Dean',
47: 'Dean',
48: 'Dean',
49: 'Dean',
50: 'Dean',
51: 'Dean',
52: 'Dean',
53: 'Dean',
54: 'Dean',
55: 'Dean',
56: 'Dean',
57: 'Dean',
58: 'Dean',
59: 'Dean',
60: 'Dean',
61: 'Dean',
62: 'Dean',
63: 'Dean',
64: 'Dean',
65: 'Dean',
66: 'Dean',
67: 'Dean',
68: 'Dean',
69: 'Dean',
70: 'Dean',
71: 'Dean',
72: 'Dean',
73: 'Dean',
74: 'Dean',
75: 'Dean',
76: 'Dean',
77: 'Dean',
78: 'Dean',
79: 'Dean',
80: 'Dean'},
'holiday_type': {0: 'holiday',
1: 'holiday',
2: 'holiday',
3: 'holiday',
4: 'holiday',
5: 'holiday',
6: 'holiday',
7: 'holiday',
8: 'holiday',
9: 'sick leave',
10: 'holiday',
11: 'holiday',
12: 'holiday',
13: 'holiday',
14: 'holiday',
15: 'holiday',
16: 'holiday',
17: 'holiday',
18: 'holiday',
19: 'holiday',
20: 'holiday',
21: 'holiday',
22: 'holiday',
23: 'holiday',
24: 'holiday',
25: 'holiday',
26: 'holiday',
27: 'holiday',
28: 'holiday',
29: 'holiday',
30: 'holiday',
31: 'holiday',
32: 'holiday',
33: 'holiday',
34: 'holiday',
35: 'holiday',
36: 'holiday',
37: 'holiday',
38: 'holiday',
39: 'holiday',
40: 'holiday',
41: 'holiday',
42: 'holiday',
43: 'holiday',
44: 'holiday',
45: 'holiday',
46: 'holiday',
47: 'holiday',
48: 'holiday',
49: 'holiday',
50: 'holiday',
51: 'holiday',
52: 'holiday',
53: 'holiday',
54: 'holiday',
55: 'holiday',
56: 'holiday',
57: 'holiday',
58: 'holiday',
59: 'holiday',
60: 'holiday',
61: 'sick leave',
62: 'sick leave',
63: 'sick leave',
64: 'sick leave',
65: 'sick leave',
66: 'sick leave',
67: 'sick leave',
68: 'sick leave',
69: 'sick leave',
70: 'sick leave',
71: 'sick leave',
72: 'sick leave',
73: 'sick leave',
74: 'sick leave',
75: 'sick leave',
76: 'sick leave',
77: 'sick leave',
78: 'sick leave',
79: 'sick leave',
80: 'sick leave'},
'date': {0: Timestamp('2020-01-01 00:00:00'),
1: Timestamp('2020-01-02 00:00:00'),
2: Timestamp('2020-01-03 00:00:00'),
3: Timestamp('2020-02-04 00:00:00'),
4: Timestamp('2020-02-05 00:00:00'),
5: Timestamp('2020-02-06 00:00:00'),
6: Timestamp('2020-02-07 00:00:00'),
7: Timestamp('2020-02-08 00:00:00'),
8: Timestamp('2020-02-09 00:00:00'),
9: Timestamp('2020-03-02 00:00:00'),
10: Timestamp('2020-04-09 00:00:00'),
11: Timestamp('2020-04-10 00:00:00'),
12: Timestamp('2020-04-11 00:00:00'),
13: Timestamp('2020-04-12 00:00:00'),
14: Timestamp('2020-04-13 00:00:00'),
15: Timestamp('2020-04-14 00:00:00'),
16: Timestamp('2020-04-15 00:00:00'),
17: Timestamp('2020-04-16 00:00:00'),
18: Timestamp('2020-04-17 00:00:00'),
19: Timestamp('2020-04-18 00:00:00'),
20: Timestamp('2020-04-19 00:00:00'),
21: Timestamp('2020-04-20 00:00:00'),
22: Timestamp('2020-04-21 00:00:00'),
23: Timestamp('2020-04-22 00:00:00'),
24: Timestamp('2020-04-23 00:00:00'),
25: Timestamp('2020-04-24 00:00:00'),
26: Timestamp('2020-04-25 00:00:00'),
27: Timestamp('2020-04-26 00:00:00'),
28: Timestamp('2020-04-27 00:00:00'),
29: Timestamp('2020-04-28 00:00:00'),
30: Timestamp('2020-04-29 00:00:00'),
31: Timestamp('2020-04-30 00:00:00'),
32: Timestamp('2020-05-01 00:00:00'),
33: Timestamp('2020-05-02 00:00:00'),
34: Timestamp('2020-05-03 00:00:00'),
35: Timestamp('2020-05-04 00:00:00'),
36: Timestamp('2020-05-05 00:00:00'),
37: Timestamp('2020-05-06 00:00:00'),
38: Timestamp('2020-05-07 00:00:00'),
39: Timestamp('2020-05-08 00:00:00'),
40: Timestamp('2020-05-09 00:00:00'),
41: Timestamp('2020-05-10 00:00:00'),
42: Timestamp('2020-05-11 00:00:00'),
43: Timestamp('2020-05-12 00:00:00'),
44: Timestamp('2020-05-13 00:00:00'),
45: Timestamp('2020-05-14 00:00:00'),
46: Timestamp('2020-05-15 00:00:00'),
47: Timestamp('2020-08-06 00:00:00'),
48: Timestamp('2020-08-07 00:00:00'),
49: Timestamp('2020-08-08 00:00:00'),
50: Timestamp('2020-08-09 00:00:00'),
51: Timestamp('2020-08-10 00:00:00'),
52: Timestamp('2020-08-11 00:00:00'),
53: Timestamp('2020-08-12 00:00:00'),
54: Timestamp('2020-08-13 00:00:00'),
55: Timestamp('2020-08-14 00:00:00'),
56: Timestamp('2020-08-15 00:00:00'),
57: Timestamp('2020-08-16 00:00:00'),
58: Timestamp('2020-08-17 00:00:00'),
59: Timestamp('2020-08-18 00:00:00'),
60: Timestamp('2020-08-19 00:00:00'),
61: Timestamp('2020-12-12 00:00:00'),
62: Timestamp('2020-12-13 00:00:00'),
63: Timestamp('2020-12-14 00:00:00'),
64: Timestamp('2020-12-15 00:00:00'),
65: Timestamp('2020-12-16 00:00:00'),
66: Timestamp('2020-12-17 00:00:00'),
67: Timestamp('2020-12-18 00:00:00'),
68: Timestamp('2020-12-19 00:00:00'),
69: Timestamp('2020-12-20 00:00:00'),
70: Timestamp('2020-12-21 00:00:00'),
71: Timestamp('2020-12-22 00:00:00'),
72: Timestamp('2020-12-23 00:00:00'),
73: Timestamp('2020-12-24 00:00:00'),
74: Timestamp('2020-12-25 00:00:00'),
75: Timestamp('2020-12-26 00:00:00'),
76: Timestamp('2020-12-27 00:00:00'),
77: Timestamp('2020-12-28 00:00:00'),
78: Timestamp('2020-12-29 00:00:00'),
79: Timestamp('2020-12-30 00:00:00'),
80: Timestamp('2020-12-31 00:00:00')}}
回答2:
Similar to @Scott Boston but with groupby.resample
:
(df.set_index(['name','holiday_type'], append=True).stack()
.reset_index(name='date_range')
.set_index('date_range')
.groupby('level_0')
.resample('D')['name','holiday_type'].ffill()
.reset_index()
[['name', 'date_range', 'holiday_type']]
)
name date_range holiday_type
0 Khan 2020-01-01 holiday
1 Khan 2020-01-02 holiday
2 Khan 2020-01-03 holiday
3 Khan 2020-02-04 holiday
4 Khan 2020-02-05 holiday
5 Khan 2020-02-06 holiday
6 Khan 2020-02-07 holiday
7 Khan 2020-02-08 holiday
8 Khan 2020-02-09 holiday
9 Khan 2020-03-02 sick leave
10 Dean 2020-04-09 holiday
11 Dean 2020-04-10 holiday
回答3:
Alternate solution using pd.Series.map.
df.set_index(['name','holiday_type'])
df['date_range'] = df[['start_date','end_date']].values
df.date_range.map(lambda x:pd.date_range(*x)).explode().reset_index()
name holiday_type date_range
0 Khan holiday 2020-01-01
1 Khan holiday 2020-01-02
2 Khan holiday 2020-01-03
3 Khan holiday 2020-02-04
4 Khan holiday 2020-02-05
.. ... ... ...
76 Dean sick leave 2020-12-27
77 Dean sick leave 2020-12-28
78 Dean sick leave 2020-12-29
79 Dean sick leave 2020-12-30
80 Dean sick leave 2020-12-31
[81 rows x 3 columns]
回答4:
Another solution is using index.repeat
, list comprehension.
df_final = df.loc[df.index.repeat((df.end_date - df.start_date).dt.days+1),
['name', 'holiday_type']]
df_final['d_range'] = np.concatenate([pd.date_range(*x) for x in zip(df.start_date, df.end_date)])
Out[61]:
name holiday_type d_range
0 Khan holiday 2020-01-01
0 Khan holiday 2020-01-02
0 Khan holiday 2020-01-03
1 Khan holiday 2020-02-04
1 Khan holiday 2020-02-05
.. ... ... ...
5 Dean sick leave 2020-12-27
5 Dean sick leave 2020-12-28
5 Dean sick leave 2020-12-29
5 Dean sick leave 2020-12-30
5 Dean sick leave 2020-12-31
[81 rows x 3 columns]
来源:https://stackoverflow.com/questions/62178394/resample-a-start-end-employee-holiday-table-correctly