I have a DataFrame where each row has two columns: date, and mentions. The end result would be a Dataframe of mentions per date, which should be easy via GroupBy if I can br
From sklearn
from sklearn.preprocessing import MultiLabelBinarizer
mlb = MultiLabelBinarizer()
pd.DataFrame(mlb.fit_transform(df['mentions'].str.split(',')),columns=mlb.classes_, index=df.date).sum(level=0)
Out[1745]:
alpha beta delta gamma
date
2018-01-01 2 1 0 1
2018-01-02 0 1 0 0
2018-01-03 0 0 1 0
2018-01-05 1 0 0 0
2018-01-07 1 0 0 0
2018-01-10 0 0 1 1
2018-01-11 0 0 0 1
Borrow Zero's resample('D')
pd.DataFrame(mlb.fit_transform(df['mentions'].str.split(',')),columns=mlb.classes_, index=df.date).sum(level=0).resample('D')
If your end result is dummy columns then use pd.Series.str.get_dummies
df.set_index('date').mentions.str.get_dummies(', ').sum(level=0)
alpha beta delta gamma
date
2018-01-01 2 1 0 1
2018-01-02 0 1 0 0
2018-01-03 0 0 1 0
2018-01-05 1 0 0 0
2018-01-07 1 0 0 0
2018-01-10 0 0 1 1
2018-01-11 0 0 0 1
As mentioned by @Zero
df.set_index('date').mentions.str.get_dummies(', ').resample('D').sum()
alpha beta delta gamma
date
2018-01-01 2 1 0 1
2018-01-02 0 1 0 0
2018-01-03 0 0 1 0
2018-01-04 0 0 0 0
2018-01-05 1 0 0 0
2018-01-06 0 0 0 0
2018-01-07 1 0 0 0
2018-01-08 0 0 0 0
2018-01-09 0 0 0 0
2018-01-10 0 0 1 1
2018-01-11 0 0 0 1