How to use lag and rangeBetween functions on timestamp values?

前端 未结 2 1464
鱼传尺愫
鱼传尺愫 2021-02-06 09:09

I have data that looks like this:

userid,eventtime,location_point
4e191908,2017-06-04 03:00:00,18685891
4e191908,2017-06         


        
2条回答
  •  感情败类
    2021-02-06 09:26

    Given your data:

    Let's add a column with a timestamp in seconds:

    df = df.withColumn('timestamp',df_taf.eventtime.astype('Timestamp').cast("long"))
    df.show()
    
    +--------+-------------------+--------------+----------+
    |  userid|          eventtime|location_point| timestamp|  
    +--------+-------------------+--------------+----------+
    |4e191908|2017-06-04 03:00:00|      18685891|1496545200|
    |4e191908|2017-06-04 03:04:00|      18685891|1496545440|
    |3136afcb|2017-06-04 03:03:00|      18382821|1496545380|
    |661212dd|2017-06-04 03:06:00|      80831484|1496545560|
    |40e8a7c3|2017-06-04 03:12:00|      18825769|1496545920|
    |4e191908|2017-06-04 03:11:30|      18685891|1496545890|
    +--------+-------------------+--------------+----------+  
    

    Now, let's define a window function, with a partition by location_point, an order by timestamp and a range between -300s and current time. We can count the number of elements in this window and put these data in a column named 'occurences in_5_min':

    w = Window.partitionBy('location_point').orderBy('timestamp').rangeBetween(-60*5,0)
    df = df.withColumn('occurrences_in_5_min',F.count('timestamp').over(w))
    df.show()
    
    +--------+-------------------+--------------+----------+--------------------+
    |  userid|          eventtime|location_point| timestamp|occurrences_in_5_min|
    +--------+-------------------+--------------+----------+--------------------+
    |40e8a7c3|2017-06-04 03:12:00|      18825769|1496545920|                   1|
    |3136afcb|2017-06-04 03:03:00|      18382821|1496545380|                   1|
    |661212dd|2017-06-04 03:06:00|      80831484|1496545560|                   1|
    |4e191908|2017-06-04 03:00:00|      18685891|1496545200|                   1|
    |4e191908|2017-06-04 03:04:00|      18685891|1496545440|                   2|
    |4e191908|2017-06-04 03:11:30|      18685891|1496545890|                   1|
    +--------+-------------------+--------------+----------+--------------------+
    

    Now you can add the desired column with True if the number of occurences is strictly more than 1 in the last 5 minutes on a particular location:

    add_bool = udf(lambda col : True if col>1 else False, BooleanType())
    df = df.withColumn('already_occured',add_bool('occurrences_in_5_min'))
    df.show()
    
    +--------+-------------------+--------------+----------+--------------------+---------------+
    |  userid|          eventtime|location_point| timestamp|occurrences_in_5_min|already_occured|
    +--------+-------------------+--------------+----------+--------------------+---------------+
    |40e8a7c3|2017-06-04 03:12:00|      18825769|1496545920|                   1|          false|
    |3136afcb|2017-06-04 03:03:00|      18382821|1496545380|                   1|          false|
    |661212dd|2017-06-04 03:06:00|      80831484|1496545560|                   1|          false|
    |4e191908|2017-06-04 03:00:00|      18685891|1496545200|                   1|          false|
    |4e191908|2017-06-04 03:04:00|      18685891|1496545440|                   2|           true|
    |4e191908|2017-06-04 03:11:30|      18685891|1496545890|                   1|          false|
    +--------+-------------------+--------------+----------+--------------------+---------------+
    

提交回复
热议问题