I have a Dask dataframe that looks like this:
url referrer session_id ts customer
url1 ref1 xxx 2017-09-15 00:00:00
The following does indeed work:
gb = df.groupby(['customer', 'url', 'ts'])
gb.apply(lambda d: pd.DataFrame({'views': len(d),
'visitiors': d.session_id.count(),
'referrers': [d.referer.tolist()]})).reset_index()
(assuming visitors should be unique as per the sql above)
You may wish to define the meta
of the output.
This is the link to the github issue that @j-bennet opened that gives an additional option. Based on the issue we implemented the aggregation as follows:
custom_agg = dd.Aggregation(
'custom_agg',
lambda s: s.apply(set),
lambda s: s.apply(lambda chunks: list(set(itertools.chain.from_iterable(chunks)))),
)
.
In order to combine with the count the code is as follows
dfgp = df.groupby(['ID1','ID2'])
df2 = dfgp.assign(cnt=dfgp.size()).agg(custom_agg).reset_index()