Pyspark : Custom window function

前端未结

关注

 2  671

时光取名叫无心 2021-02-10 08:12

I am currently trying to extract series of consecutive occurrences in a PySpark dataframe and order/rank them as shown below (for convenience I have ordered the initial datafram

2条回答

栀梦 (楼主)

2021-02-10 09:04

I'm afraid it is not possible using standard dataframe windowing functions. But you can still use old RDD API groupByKey() to achieve that transformation:

>>> from itertools import groupby
>>> 
>>> def recalculate(records):
...     actions = [r.actions for r in sorted(records[1], key=lambda r: r.timestamp)]
...     groups = [list(g) for k, g in groupby(actions)]
...     return [(records[0], g[0], len(g), i+1) for i, g in enumerate(groups)]
... 
>>> df_ini.rdd.map(lambda row: (row.user_id, row)) \
...     .groupByKey().flatMap(recalculate) \
...     .toDF(['user_id', 'actions', 'nf_of_occ', 'order']).show()
+-------+-------+---------+-----+
|user_id|actions|nf_of_occ|order|
+-------+-------+---------+-----+
| 217498|      A|        3|    1|
| 217498|      B|        1|    2|
| 217498|      C|        2|    3|
| 217498|      A|        1|    4|
| 217498|      B|        2|    5|
| 854123|      A|        2|    1|
+-------+-------+---------+-----+

0 讨论(0)

查看其它2个回答