I have a pandas dataframe with several columns (words, start time, stop time, speaker). I want to combine all values in the \'word\' column while the values in the \'speaker\' c
We'll use GroupBy.agg
with a dict of aggfuncs:
(df.groupby('speaker', as_index=False, sort=False)
.agg({'word': ' '.join, 'start': 'min', 'stop': 'max',}))
speaker word start stop
0 2 but that's alright 2.72 3.47
1 1 we'll have to 8.43 9.07
To group by consecutive occurrences, use the shifting cumsum trick, then use that as the second grouper along with "speaker":
gp1 = df['speaker'].ne(df['speaker'].shift()).cumsum()
(df.groupby(['speaker', gp1], as_index=False, sort=False)
.agg({'word': ' '.join, 'start': 'min', 'stop': 'max',}))
speaker word start stop
0 2 but that's alright 2.72 3.47
1 1 we'll have to 8.43 9.07
2 2 okay sure 9.19 11.01
3 1 what? 11.02 12.00