resample a start & end employee holiday table correctly

梦想与她 提交于 2021-02-19 02:26:09

问题


I have the following dataframe.

df = pd.DataFrame(
    {'name' : ['Khan','Khan','Khan','Dean','Dean','Dean'],
     'start_date' : ['01-01-2020','04-02-2020','02-03-2020','09-04-2020','06-08-2020','12-12-2020'],
     'end_date' : ['03-01-2020', '09-02-2020','02-03-2020','15-05-2020','19-08-2020','31-12-2020'],
     'holiday_type' : ['holiday','holiday','sick leave','holiday','holiday','sick leave']

    } )

df[['start_date','end_date']] = df[['start_date','end_date']].apply(pd.to_datetime,format='%d-%m-%Y')

print(df)

   name start_date   end_date holiday_type
0  Khan 2020-01-01 2020-01-03      holiday
1  Khan 2020-02-04 2020-02-09      holiday
2  Khan 2020-03-02 2020-03-02   sick leave
3  Dean 2020-04-09 2020-05-15      holiday
4  Dean 2020-08-06 2020-08-19      holiday
5  Dean 2020-12-12 2020-12-31   sick leave

I'm attempting to re-sample the data by start and end date without over lapping i.e i don't want a table for Khan that starts on the 2020-01-02 and ends on 2020-03-02

my own attempt as been to melt, index and use groupby.resample however I'm unsure how to only group by each start and end date, a cumulative count ? but it doesn't seem very efficent either.

ideal output :

    name date_range holiday_type
0   Khan 2020-01-01      holiday
1   Khan 2020-02-01      holiday
2   Khan 2020-03-01      holiday # < end of holiday 1
3   Khan 2020-04-02      holiday 
4   Khan 2020-05-02      holiday
5   Khan 2020-06-02      holiday
6   Khan 2020-07-02      holiday
7   Khan 2020-08-02      holiday
8   Khan 2020-09-02      holiday # end of holiday 2
9   Khan 2020-02-03   sick leave # one day sick leave, can also have one day holiday.
10  Dean 2020-09-04      holiday
11  Dean 2020-10-04      holiday
12  Dean 2020-11-04      holiday
13  Dean 2020-12-04      holiday
14  Dean 2020-04-13      holiday
15  Dean 2020-04-14      holiday
16  Dean 2020-04-15      holiday
17  Dean 2020-06-08      holiday
18  Dean 2020-07-08      holiday
19  Dean 2020-08-08      holiday
20  Dean 2020-09-08      holiday
21  Dean 2020-10-08      holiday
22  Dean 2020-11-08      holiday
23  Dean 2020-12-08      holiday
24  Dean 2020-08-13      holiday
25  Dean 2020-08-14      holiday
26  Dean 2020-08-15      holiday
27  Dean 2020-08-16      holiday
28  Dean 2020-08-17      holiday
29  Dean 2020-08-18      holiday
30  Dean 2020-08-19      holiday
31  Dean 2020-12-12   sick leave
32  Dean 2020-12-13   sick leave
33  Dean 2020-12-14   sick leave
34  Dean 2020-12-15   sick leave
35  Dean 2020-12-16   sick leave
36  Dean 2020-12-17   sick leave
37  Dean 2020-12-18   sick leave
38  Dean 2020-12-19   sick leave
39  Dean 2020-12-20   sick leave
40  Dean 2020-12-21   sick leave
41  Dean 2020-12-22   sick leave
42  Dean 2020-12-23   sick leave
43  Dean 2020-12-24   sick leave
44  Dean 2020-12-25   sick leave
45  Dean 2020-12-26   sick leave
46  Dean 2020-12-27   sick leave
47  Dean 2020-12-28   sick leave
48  Dean 2020-12-29   sick leave
49  Dean 2020-12-30   sick leave
50  Dean 2020-12-31   sick leave

ideal output as dict.

{'name': {0: 'Khan', 1: 'Khan', 2: 'Khan', 3: 'Khan', 4: 'Khan', 5: 'Khan', 6: 'Khan', 7: 'Khan', 8: 'Khan', 9: 'Khan', 10: 'Dean', 11: 'Dean', 12: 'Dean', 13: 'Dean', 14: 'Dean', 15: 'Dean', 16: 'Dean', 17: 'Dean', 18: 'Dean', 19: 'Dean', 20: 'Dean', 21: 'Dean', 22: 'Dean', 23: 'Dean', 24: 'Dean', 25: 'Dean', 26: 'Dean', 27: 'Dean', 28: 'Dean', 29: 'Dean', 30: 'Dean', 31: 'Dean', 32: 'Dean', 33: 'Dean', 34: 'Dean', 35: 'Dean', 36: 'Dean', 37: 'Dean', 38: 'Dean', 39: 'Dean', 40: 'Dean', 41: 'Dean', 42: 'Dean', 43: 'Dean', 44: 'Dean', 45: 'Dean', 46: 'Dean', 47: 'Dean', 48: 'Dean', 49: 'Dean', 50: 'Dean'}, 'date_range': {0: Timestamp('2020-01-01 00:00:00'), 1: Timestamp('2020-02-01 00:00:00'), 2: Timestamp('2020-03-01 00:00:00'), 3: Timestamp('2020-04-02 00:00:00'), 4: Timestamp('2020-05-02 00:00:00'), 5: Timestamp('2020-06-02 00:00:00'), 6: Timestamp('2020-07-02 00:00:00'), 7: Timestamp('2020-08-02 00:00:00'), 8: Timestamp('2020-09-02 00:00:00'), 9: Timestamp('2020-02-03 00:00:00'), 10: Timestamp('2020-09-04 00:00:00'), 11: Timestamp('2020-10-04 00:00:00'), 12: Timestamp('2020-11-04 00:00:00'), 13: Timestamp('2020-12-04 00:00:00'), 14: Timestamp('2020-04-13 00:00:00'), 15: Timestamp('2020-04-14 00:00:00'), 16: Timestamp('2020-04-15 00:00:00'), 17: Timestamp('2020-06-08 00:00:00'), 18: Timestamp('2020-07-08 00:00:00'), 19: Timestamp('2020-08-08 00:00:00'), 20: Timestamp('2020-09-08 00:00:00'), 21: Timestamp('2020-10-08 00:00:00'), 22: Timestamp('2020-11-08 00:00:00'), 23: Timestamp('2020-12-08 00:00:00'), 24: Timestamp('2020-08-13 00:00:00'), 25: Timestamp('2020-08-14 00:00:00'), 26: Timestamp('2020-08-15 00:00:00'), 27: Timestamp('2020-08-16 00:00:00'), 28: Timestamp('2020-08-17 00:00:00'), 29: Timestamp('2020-08-18 00:00:00'), 30: Timestamp('2020-08-19 00:00:00'), 31: Timestamp('2020-12-12 00:00:00'), 32: Timestamp('2020-12-13 00:00:00'), 33: Timestamp('2020-12-14 00:00:00'), 34: Timestamp('2020-12-15 00:00:00'), 35: Timestamp('2020-12-16 00:00:00'), 36: Timestamp('2020-12-17 00:00:00'), 37: Timestamp('2020-12-18 00:00:00'), 38: Timestamp('2020-12-19 00:00:00'), 39: Timestamp('2020-12-20 00:00:00'), 40: Timestamp('2020-12-21 00:00:00'), 41: Timestamp('2020-12-22 00:00:00'), 42: Timestamp('2020-12-23 00:00:00'), 43: Timestamp('2020-12-24 00:00:00'), 44: Timestamp('2020-12-25 00:00:00'), 45: Timestamp('2020-12-26 00:00:00'), 46: Timestamp('2020-12-27 00:00:00'), 47: Timestamp('2020-12-28 00:00:00'), 48: Timestamp('2020-12-29 00:00:00'), 49: Timestamp('2020-12-30 00:00:00'), 50: Timestamp('2020-12-31 00:00:00')}, 'holiday_type': {0: 'holiday', 1: 'holiday', 2: 'holiday', 3: 'holiday', 4: 'holiday', 5: 'holiday', 6: 'holiday', 7: 'holiday', 8: 'holiday', 9: 'sick leave', 10: 'holiday', 11: 'holiday', 12: 'holiday', 13: 'holiday', 14: 'holiday', 15: 'holiday', 16: 'holiday', 17: 'holiday', 18: 'holiday', 19: 'holiday', 20: 'holiday', 21: 'holiday', 22: 'holiday', 23: 'holiday', 24: 'holiday', 25: 'holiday', 26: 'holiday', 27: 'holiday', 28: 'holiday', 29: 'holiday', 30: 'holiday', 31: 'sick leave', 32: 'sick leave', 33: 'sick leave', 34: 'sick leave', 35: 'sick leave', 36: 'sick leave', 37: 'sick leave', 38: 'sick leave', 39: 'sick leave', 40: 'sick leave', 41: 'sick leave', 42: 'sick leave', 43: 'sick leave', 44: 'sick leave', 45: 'sick leave', 46: 'sick leave', 47: 'sick leave', 48: 'sick leave', 49: 'sick leave', 50: 'sick leave'}}

回答1:


IIUC,

df_out = (df.set_index(['name','holiday_type'])
            .apply(lambda x: pd.date_range(x['start_date'], x['end_date']), axis=1)
            .explode().rename('date').reset_index())

Output:

    name holiday_type       date
0   Khan      holiday 2020-01-01
1   Khan      holiday 2020-01-02
2   Khan      holiday 2020-01-03
3   Khan      holiday 2020-02-04
4   Khan      holiday 2020-02-05
..   ...          ...        ...
76  Dean   sick leave 2020-12-27
77  Dean   sick leave 2020-12-28
78  Dean   sick leave 2020-12-29
79  Dean   sick leave 2020-12-30
80  Dean   sick leave 2020-12-31

[81 rows x 3 columns]

Dictionary output:

df_out.to_dict()

Output:

{'name': {0: 'Khan',
  1: 'Khan',
  2: 'Khan',
  3: 'Khan',
  4: 'Khan',
  5: 'Khan',
  6: 'Khan',
  7: 'Khan',
  8: 'Khan',
  9: 'Khan',
  10: 'Dean',
  11: 'Dean',
  12: 'Dean',
  13: 'Dean',
  14: 'Dean',
  15: 'Dean',
  16: 'Dean',
  17: 'Dean',
  18: 'Dean',
  19: 'Dean',
  20: 'Dean',
  21: 'Dean',
  22: 'Dean',
  23: 'Dean',
  24: 'Dean',
  25: 'Dean',
  26: 'Dean',
  27: 'Dean',
  28: 'Dean',
  29: 'Dean',
  30: 'Dean',
  31: 'Dean',
  32: 'Dean',
  33: 'Dean',
  34: 'Dean',
  35: 'Dean',
  36: 'Dean',
  37: 'Dean',
  38: 'Dean',
  39: 'Dean',
  40: 'Dean',
  41: 'Dean',
  42: 'Dean',
  43: 'Dean',
  44: 'Dean',
  45: 'Dean',
  46: 'Dean',
  47: 'Dean',
  48: 'Dean',
  49: 'Dean',
  50: 'Dean',
  51: 'Dean',
  52: 'Dean',
  53: 'Dean',
  54: 'Dean',
  55: 'Dean',
  56: 'Dean',
  57: 'Dean',
  58: 'Dean',
  59: 'Dean',
  60: 'Dean',
  61: 'Dean',
  62: 'Dean',
  63: 'Dean',
  64: 'Dean',
  65: 'Dean',
  66: 'Dean',
  67: 'Dean',
  68: 'Dean',
  69: 'Dean',
  70: 'Dean',
  71: 'Dean',
  72: 'Dean',
  73: 'Dean',
  74: 'Dean',
  75: 'Dean',
  76: 'Dean',
  77: 'Dean',
  78: 'Dean',
  79: 'Dean',
  80: 'Dean'},
 'holiday_type': {0: 'holiday',
  1: 'holiday',
  2: 'holiday',
  3: 'holiday',
  4: 'holiday',
  5: 'holiday',
  6: 'holiday',
  7: 'holiday',
  8: 'holiday',
  9: 'sick leave',
  10: 'holiday',
  11: 'holiday',
  12: 'holiday',
  13: 'holiday',
  14: 'holiday',
  15: 'holiday',
  16: 'holiday',
  17: 'holiday',
  18: 'holiday',
  19: 'holiday',
  20: 'holiday',
  21: 'holiday',
  22: 'holiday',
  23: 'holiday',
  24: 'holiday',
  25: 'holiday',
  26: 'holiday',
  27: 'holiday',
  28: 'holiday',
  29: 'holiday',
  30: 'holiday',
  31: 'holiday',
  32: 'holiday',
  33: 'holiday',
  34: 'holiday',
  35: 'holiday',
  36: 'holiday',
  37: 'holiday',
  38: 'holiday',
  39: 'holiday',
  40: 'holiday',
  41: 'holiday',
  42: 'holiday',
  43: 'holiday',
  44: 'holiday',
  45: 'holiday',
  46: 'holiday',
  47: 'holiday',
  48: 'holiday',
  49: 'holiday',
  50: 'holiday',
  51: 'holiday',
  52: 'holiday',
  53: 'holiday',
  54: 'holiday',
  55: 'holiday',
  56: 'holiday',
  57: 'holiday',
  58: 'holiday',
  59: 'holiday',
  60: 'holiday',
  61: 'sick leave',
  62: 'sick leave',
  63: 'sick leave',
  64: 'sick leave',
  65: 'sick leave',
  66: 'sick leave',
  67: 'sick leave',
  68: 'sick leave',
  69: 'sick leave',
  70: 'sick leave',
  71: 'sick leave',
  72: 'sick leave',
  73: 'sick leave',
  74: 'sick leave',
  75: 'sick leave',
  76: 'sick leave',
  77: 'sick leave',
  78: 'sick leave',
  79: 'sick leave',
  80: 'sick leave'},
 'date': {0: Timestamp('2020-01-01 00:00:00'),
  1: Timestamp('2020-01-02 00:00:00'),
  2: Timestamp('2020-01-03 00:00:00'),
  3: Timestamp('2020-02-04 00:00:00'),
  4: Timestamp('2020-02-05 00:00:00'),
  5: Timestamp('2020-02-06 00:00:00'),
  6: Timestamp('2020-02-07 00:00:00'),
  7: Timestamp('2020-02-08 00:00:00'),
  8: Timestamp('2020-02-09 00:00:00'),
  9: Timestamp('2020-03-02 00:00:00'),
  10: Timestamp('2020-04-09 00:00:00'),
  11: Timestamp('2020-04-10 00:00:00'),
  12: Timestamp('2020-04-11 00:00:00'),
  13: Timestamp('2020-04-12 00:00:00'),
  14: Timestamp('2020-04-13 00:00:00'),
  15: Timestamp('2020-04-14 00:00:00'),
  16: Timestamp('2020-04-15 00:00:00'),
  17: Timestamp('2020-04-16 00:00:00'),
  18: Timestamp('2020-04-17 00:00:00'),
  19: Timestamp('2020-04-18 00:00:00'),
  20: Timestamp('2020-04-19 00:00:00'),
  21: Timestamp('2020-04-20 00:00:00'),
  22: Timestamp('2020-04-21 00:00:00'),
  23: Timestamp('2020-04-22 00:00:00'),
  24: Timestamp('2020-04-23 00:00:00'),
  25: Timestamp('2020-04-24 00:00:00'),
  26: Timestamp('2020-04-25 00:00:00'),
  27: Timestamp('2020-04-26 00:00:00'),
  28: Timestamp('2020-04-27 00:00:00'),
  29: Timestamp('2020-04-28 00:00:00'),
  30: Timestamp('2020-04-29 00:00:00'),
  31: Timestamp('2020-04-30 00:00:00'),
  32: Timestamp('2020-05-01 00:00:00'),
  33: Timestamp('2020-05-02 00:00:00'),
  34: Timestamp('2020-05-03 00:00:00'),
  35: Timestamp('2020-05-04 00:00:00'),
  36: Timestamp('2020-05-05 00:00:00'),
  37: Timestamp('2020-05-06 00:00:00'),
  38: Timestamp('2020-05-07 00:00:00'),
  39: Timestamp('2020-05-08 00:00:00'),
  40: Timestamp('2020-05-09 00:00:00'),
  41: Timestamp('2020-05-10 00:00:00'),
  42: Timestamp('2020-05-11 00:00:00'),
  43: Timestamp('2020-05-12 00:00:00'),
  44: Timestamp('2020-05-13 00:00:00'),
  45: Timestamp('2020-05-14 00:00:00'),
  46: Timestamp('2020-05-15 00:00:00'),
  47: Timestamp('2020-08-06 00:00:00'),
  48: Timestamp('2020-08-07 00:00:00'),
  49: Timestamp('2020-08-08 00:00:00'),
  50: Timestamp('2020-08-09 00:00:00'),
  51: Timestamp('2020-08-10 00:00:00'),
  52: Timestamp('2020-08-11 00:00:00'),
  53: Timestamp('2020-08-12 00:00:00'),
  54: Timestamp('2020-08-13 00:00:00'),
  55: Timestamp('2020-08-14 00:00:00'),
  56: Timestamp('2020-08-15 00:00:00'),
  57: Timestamp('2020-08-16 00:00:00'),
  58: Timestamp('2020-08-17 00:00:00'),
  59: Timestamp('2020-08-18 00:00:00'),
  60: Timestamp('2020-08-19 00:00:00'),
  61: Timestamp('2020-12-12 00:00:00'),
  62: Timestamp('2020-12-13 00:00:00'),
  63: Timestamp('2020-12-14 00:00:00'),
  64: Timestamp('2020-12-15 00:00:00'),
  65: Timestamp('2020-12-16 00:00:00'),
  66: Timestamp('2020-12-17 00:00:00'),
  67: Timestamp('2020-12-18 00:00:00'),
  68: Timestamp('2020-12-19 00:00:00'),
  69: Timestamp('2020-12-20 00:00:00'),
  70: Timestamp('2020-12-21 00:00:00'),
  71: Timestamp('2020-12-22 00:00:00'),
  72: Timestamp('2020-12-23 00:00:00'),
  73: Timestamp('2020-12-24 00:00:00'),
  74: Timestamp('2020-12-25 00:00:00'),
  75: Timestamp('2020-12-26 00:00:00'),
  76: Timestamp('2020-12-27 00:00:00'),
  77: Timestamp('2020-12-28 00:00:00'),
  78: Timestamp('2020-12-29 00:00:00'),
  79: Timestamp('2020-12-30 00:00:00'),
  80: Timestamp('2020-12-31 00:00:00')}}



回答2:


Similar to @Scott Boston but with groupby.resample:

(df.set_index(['name','holiday_type'], append=True).stack()
   .reset_index(name='date_range')
   .set_index('date_range')
   .groupby('level_0')
   .resample('D')['name','holiday_type'].ffill()
   .reset_index()
   [['name', 'date_range', 'holiday_type']]
)
    name date_range holiday_type
0   Khan 2020-01-01      holiday
1   Khan 2020-01-02      holiday
2   Khan 2020-01-03      holiday
3   Khan 2020-02-04      holiday
4   Khan 2020-02-05      holiday
5   Khan 2020-02-06      holiday
6   Khan 2020-02-07      holiday
7   Khan 2020-02-08      holiday
8   Khan 2020-02-09      holiday
9   Khan 2020-03-02   sick leave
10  Dean 2020-04-09      holiday
11  Dean 2020-04-10      holiday



回答3:


Alternate solution using pd.Series.map.

df.set_index(['name','holiday_type'])
df['date_range'] = df[['start_date','end_date']].values
df.date_range.map(lambda x:pd.date_range(*x)).explode().reset_index()

    name holiday_type date_range
0   Khan      holiday 2020-01-01
1   Khan      holiday 2020-01-02
2   Khan      holiday 2020-01-03
3   Khan      holiday 2020-02-04
4   Khan      holiday 2020-02-05
..   ...          ...        ...
76  Dean   sick leave 2020-12-27
77  Dean   sick leave 2020-12-28
78  Dean   sick leave 2020-12-29
79  Dean   sick leave 2020-12-30
80  Dean   sick leave 2020-12-31

[81 rows x 3 columns]



回答4:


Another solution is using index.repeat, list comprehension.

df_final = df.loc[df.index.repeat((df.end_date - df.start_date).dt.days+1), 
                  ['name', 'holiday_type']]
df_final['d_range'] = np.concatenate([pd.date_range(*x) for x in zip(df.start_date, df.end_date)])

Out[61]:
    name holiday_type    d_range
0   Khan      holiday 2020-01-01
0   Khan      holiday 2020-01-02
0   Khan      holiday 2020-01-03
1   Khan      holiday 2020-02-04
1   Khan      holiday 2020-02-05
..   ...          ...        ...
5   Dean   sick leave 2020-12-27
5   Dean   sick leave 2020-12-28
5   Dean   sick leave 2020-12-29
5   Dean   sick leave 2020-12-30
5   Dean   sick leave 2020-12-31

[81 rows x 3 columns]


来源:https://stackoverflow.com/questions/62178394/resample-a-start-end-employee-holiday-table-correctly

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!