Pyspark window function with condition

后端未结

关注

 3  598

广开言路 2021-02-08 23:13

Suppose I have a DataFrame of events with time difference between each row, the main rule is that one visit is counted if only the event has been within 5 minutes of the previ

3条回答

臣服心动 (楼主)

2021-02-08 23:56
So if I understand this correctly you essentially want to end each group when TimeDiff > 300? This seems relatively straightforward with rolling window functions:

First some imports
```
from pyspark.sql.window import Window
import pyspark.sql.functions as func
```
Then setting windows, I assumed you would partition by userid
```
w = Window.partitionBy("userid").orderBy("eventtime")
```
Then figuring out what subgroup each observation falls into, by first marking the first member of each group, then summing the column.
```
indicator = (TimeDiff > 300).cast("integer")
subgroup = func.sum(indicator).over(w).alias("subgroup")
```
Then some aggregation functions and you should be done
```
DF = DF.select("*", subgroup)\
.groupBy("subgroup")\
.agg(
    func.min("eventtime").alias("start_time"),
    func.max("eventtime").alias("end_time"),
    func.count(func.lit(1)).alias("events")
)
```
0 讨论(0)

查看其它3个回答
发布评论:

提交评论
- 加载中...