I have data that looks like this:
userid,eventtime,location_point
4e191908,2017-06-04 03:00:00,18685891
4e191908,2017-06
rangeBetween
just doesn't make sense for non-aggregate function like lag
. lag
takes always a specific row, denoted by offset argument, so specifying frame is pointless.
To get a window over time series you can use window
grouping with standard aggregates:
from pyspark.sql.functions import window, countDistinct
(df
.groupBy("location_point", window("eventtime", "5 minutes"))
.agg( countDistinct("userid")))
You can add more arguments to modify slide duration.
You can try something similar with window functions if you partition by location
:
windowSpec = (W.partitionBy(col("location"))
.orderBy(col("eventtime").cast("timestamp").cast("long"))
.rangeBetween(0, days(5)))
df.withColumn("id_count", countDistinct("userid").over(windowSpec))