Pyspark : Custom window function

前端 未结 2 671
时光取名叫无心
时光取名叫无心 2021-02-10 08:12

I am currently trying to extract series of consecutive occurrences in a PySpark dataframe and order/rank them as shown below (for convenience I have ordered the initial datafram

2条回答
  •  栀梦
    栀梦 (楼主)
    2021-02-10 09:04

    I'm afraid it is not possible using standard dataframe windowing functions. But you can still use old RDD API groupByKey() to achieve that transformation:

    >>> from itertools import groupby
    >>> 
    >>> def recalculate(records):
    ...     actions = [r.actions for r in sorted(records[1], key=lambda r: r.timestamp)]
    ...     groups = [list(g) for k, g in groupby(actions)]
    ...     return [(records[0], g[0], len(g), i+1) for i, g in enumerate(groups)]
    ... 
    >>> df_ini.rdd.map(lambda row: (row.user_id, row)) \
    ...     .groupByKey().flatMap(recalculate) \
    ...     .toDF(['user_id', 'actions', 'nf_of_occ', 'order']).show()
    +-------+-------+---------+-----+
    |user_id|actions|nf_of_occ|order|
    +-------+-------+---------+-----+
    | 217498|      A|        3|    1|
    | 217498|      B|        1|    2|
    | 217498|      C|        2|    3|
    | 217498|      A|        1|    4|
    | 217498|      B|        2|    5|
    | 854123|      A|        2|    1|
    +-------+-------+---------+-----+
    

提交回复
热议问题