I am currently trying to extract series of consecutive occurrences in a PySpark dataframe and order/rank them as shown below (for convenience I have ordered the initial datafram
I'm afraid it is not possible using standard dataframe windowing functions. But you can still use old RDD API groupByKey()
to achieve that transformation:
>>> from itertools import groupby
>>>
>>> def recalculate(records):
... actions = [r.actions for r in sorted(records[1], key=lambda r: r.timestamp)]
... groups = [list(g) for k, g in groupby(actions)]
... return [(records[0], g[0], len(g), i+1) for i, g in enumerate(groups)]
...
>>> df_ini.rdd.map(lambda row: (row.user_id, row)) \
... .groupByKey().flatMap(recalculate) \
... .toDF(['user_id', 'actions', 'nf_of_occ', 'order']).show()
+-------+-------+---------+-----+
|user_id|actions|nf_of_occ|order|
+-------+-------+---------+-----+
| 217498| A| 3| 1|
| 217498| B| 1| 2|
| 217498| C| 2| 3|
| 217498| A| 1| 4|
| 217498| B| 2| 5|
| 854123| A| 2| 1|
+-------+-------+---------+-----+