How to use lag and rangeBetween functions on timestamp values?

前端 未结 2 1465
鱼传尺愫
鱼传尺愫 2021-02-06 09:09

I have data that looks like this:

userid,eventtime,location_point
4e191908,2017-06-04 03:00:00,18685891
4e191908,2017-06         


        
2条回答
  •  遥遥无期
    2021-02-06 09:32

    rangeBetween just doesn't make sense for non-aggregate function like lag. lag takes always a specific row, denoted by offset argument, so specifying frame is pointless.

    To get a window over time series you can use window grouping with standard aggregates:

    from pyspark.sql.functions import window,  countDistinct
    
    
    (df
        .groupBy("location_point", window("eventtime", "5 minutes"))
        .agg( countDistinct("userid")))
    

    You can add more arguments to modify slide duration.

    You can try something similar with window functions if you partition by location:

    windowSpec = (W.partitionBy(col("location"))
      .orderBy(col("eventtime").cast("timestamp").cast("long"))
      .rangeBetween(0, days(5)))
    
    
    df.withColumn("id_count", countDistinct("userid").over(windowSpec))
    

提交回复
热议问题