Spark Advanced Window with dynamic last

前端 未结 4 1927
星月不相逢
星月不相逢 2021-02-09 11:14

Problem: Given a time series data which is a clickstream of user activity is stored in hive, ask is to enrich the data with session id using spark.

Session Definition

4条回答
  •  孤独总比滥情好
    2021-02-09 11:50

    Not a straight forward problem to solve, but here's one approach:

    1. Use Window lag timestamp difference to identify sessions (with 0 = start of a session) per user for rule #1
    2. Group the dataset to assemble the timestamp diff list per user
    3. Process via a UDF the timestamp diff list to identify sessions for rule #2 and create all session ids per user
    4. Expand the grouped dataset via Spark's explode

    Sample code below:

    import org.apache.spark.sql.functions._
    import org.apache.spark.sql.expressions.Window
    import spark.implicits._
    
    val userActivity = Seq(
      ("2018-01-01 11:00:00", "u1"),
      ("2018-01-01 12:10:00", "u1"),
      ("2018-01-01 13:00:00", "u1"),
      ("2018-01-01 13:50:00", "u1"),
      ("2018-01-01 14:40:00", "u1"),
      ("2018-01-01 15:30:00", "u1"),
      ("2018-01-01 16:20:00", "u1"),
      ("2018-01-01 16:50:00", "u1"),
      ("2018-01-01 11:00:00", "u2"),
      ("2018-01-02 11:00:00", "u2")
    ).toDF("click_time", "user_id")
    
    def clickSessList(tmo: Long) = udf{ (uid: String, clickList: Seq[String], tsList: Seq[Long]) =>
      def sid(n: Long) = s"$uid-$n"
    
      val sessList = tsList.foldLeft( (List[String](), 0L, 0L) ){ case ((ls, j, k), i) =>
        if (i == 0 || j + i >= tmo) (sid(k + 1) :: ls, 0L, k + 1) else
           (sid(k) :: ls, j + i, k)
      }._1.reverse
    
      clickList zip sessList
    }
    

    Note that the accumulator for foldLeft in the UDF is a Tuple of (ls, j, k), where:

    • ls is the list of formatted session ids to be returned
    • j and k are for carrying over the conditionally changing timestamp value and session id number, respectively, to the next iteration

    Step 1:

    val tmo1: Long = 60 * 60
    val tmo2: Long = 2 * 60 * 60
    
    val win1 = Window.partitionBy("user_id").orderBy("click_time")
    
    val df1 = userActivity.
      withColumn("ts_diff", unix_timestamp($"click_time") - unix_timestamp(
        lag($"click_time", 1).over(win1))
      ).
      withColumn("ts_diff", when(row_number.over(win1) === 1 || $"ts_diff" >= tmo1, 0L).
        otherwise($"ts_diff")
      )
    
    df1.show
    // +-------------------+-------+-------+
    // |         click_time|user_id|ts_diff|
    // +-------------------+-------+-------+
    // |2018-01-01 11:00:00|     u1|      0|
    // |2018-01-01 12:10:00|     u1|      0|
    // |2018-01-01 13:00:00|     u1|   3000|
    // |2018-01-01 13:50:00|     u1|   3000|
    // |2018-01-01 14:40:00|     u1|   3000|
    // |2018-01-01 15:30:00|     u1|   3000|
    // |2018-01-01 16:20:00|     u1|   3000|
    // |2018-01-01 16:50:00|     u1|   1800|
    // |2018-01-01 11:00:00|     u2|      0|
    // |2018-01-02 11:00:00|     u2|      0|
    // +-------------------+-------+-------+
    

    Steps 2-4:

    val df2 = df1.
      groupBy("user_id").agg(
        collect_list($"click_time").as("click_list"), collect_list($"ts_diff").as("ts_list")
      ).
      withColumn("click_sess_id",
        explode(clickSessList(tmo2)($"user_id", $"click_list", $"ts_list"))
      ).
      select($"user_id", $"click_sess_id._1".as("click_time"), $"click_sess_id._2".as("sess_id"))
    
    df2.show
    // +-------+-------------------+-------+
    // |user_id|click_time         |sess_id|
    // +-------+-------------------+-------+
    // |u1     |2018-01-01 11:00:00|u1-1   |
    // |u1     |2018-01-01 12:10:00|u1-2   |
    // |u1     |2018-01-01 13:00:00|u1-2   |
    // |u1     |2018-01-01 13:50:00|u1-2   |
    // |u1     |2018-01-01 14:40:00|u1-3   |
    // |u1     |2018-01-01 15:30:00|u1-3   |
    // |u1     |2018-01-01 16:20:00|u1-3   |
    // |u1     |2018-01-01 16:50:00|u1-4   |
    // |u2     |2018-01-01 11:00:00|u2-1   |
    // |u2     |2018-01-02 11:00:00|u2-2   |
    // +-------+-------------------+-------+
    

    Also note that click_time is "passed thru" in steps 2-4 so as to be included in the final dataset.

提交回复
热议问题