Pyspark window function with condition

后端 未结 3 589
广开言路
广开言路 2021-02-08 23:13

Suppose I have a DataFrame of events with time difference between each row, the main rule is that one visit is counted if only the event has been within 5 minutes of the previ

3条回答
  •  情歌与酒
    2021-02-09 00:07

    You'll need one extra window function and a groupby to achieve this. What we want is for every line with timeDiff greater than 300 to be the end of a group and the start of a new one. Aku's solution should work, only the indicators mark the start of a group instead of the end. To change this you'll have to do a cumulative sum up to n-1 instead of n (n being your current line):

    w = Window.partitionBy("userid").orderBy("eventtime")
    DF = DF.withColumn("indicator", (DF.timeDiff > 300).cast("int"))
    DF = DF.withColumn("subgroup", func.sum("indicator").over(w) - func.col("indicator"))
    DF = DF.groupBy("subgroup").agg(
        func.min("eventtime").alias("start_time"), 
        func.max("eventtime").alias("end_time"),
        func.count("*").alias("events")
     )
    
    +--------+-------------------+-------------------+------+
    |subgroup|         start_time|           end_time|events|
    +--------+-------------------+-------------------+------+
    |       0|2017-06-04 03:00:00|2017-06-04 03:07:00|     6|
    |       1|2017-06-04 03:14:00|2017-06-04 03:15:00|     2|
    |       2|2017-06-04 03:34:00|2017-06-04 03:34:00|     1|
    |       3|2017-06-04 03:53:00|2017-06-04 03:53:00|     1|
    +--------+-------------------+-------------------+------+
    

    It seems that you also filter out lines with only one event, hence:

    DF = DF.filter("events != 1")
    
    +--------+-------------------+-------------------+------+
    |subgroup|         start_time|           end_time|events|
    +--------+-------------------+-------------------+------+
    |       0|2017-06-04 03:00:00|2017-06-04 03:07:00|     6|
    |       1|2017-06-04 03:14:00|2017-06-04 03:15:00|     2|
    +--------+-------------------+-------------------+------+
    

提交回复
热议问题