How to extract timed-out sessions using mapWithState

问题

I am updating my code to switch from updateStateByKey to mapWithState in order to get users' sessions based on a time-out of 2 minutes (2 is used for testing purpose only). Each session should aggregate all the streaming data (JSON string) within a session before time-out.

This was my old code:

val membersSessions = stream.map[(String, (Long, Long, List[String]))](eventRecord => {
  val parsed = Utils.parseJSON(eventRecord)
  val member_id = parsed.getOrElse("member_id", "")
  val timestamp = parsed.getOrElse("timestamp", "").toLong
  //The timestamp is returned twice because the first one will be used as the start time and the second one as the end time
  (member_id, (timestamp, timestamp, List(eventRecord)))
})

val latestSessionInfo = membersSessions.map[(String, (Long, Long, Long, List[String]))](a => {
  //transform to (member_id, (time, time, counter, events within session))
  (a._1, (a._2._1, a._2._2, 1, a._2._3))
}).
  reduceByKey((a, b) => {
    //transform to (member_id, (lowestStartTime, MaxFinishTime, sumOfCounter, events within session))
    (Math.min(a._1, b._1), Math.max(a._2, b._2), a._3 + b._3, a._4 ++ b._4)
  }).updateStateByKey(Utils.updateState)

The problems of updateStateByKey are nicely explained here. One of the key reasons why I decided to use mapWithState is because updateStateByKey was unable to return finished sessions (the ones that have timed out) for further processing.

This is my first attempt to transform the old code to the new version:

val spec = StateSpec.function(updateState _).timeout(Minutes(1))
val latestSessionInfo = membersSessions.map[(String, (Long, Long, Long, List[String]))](a => {
  //transform to (member_id, (time, time, counter, events within session))
  (a._1, (a._2._1, a._2._2, 1, a._2._3))
})
val userSessionSnapshots = latestSessionInfo.mapWithState(spec).snapshotStream()

I slightly misunderstand what shoud be the content of updateState, because as far as I understand the time-out should not be calculated manually (it was previously done in my function Utils.updateState) and .snapshotStream should return the timed-out sessions.

回答1:

Assuming you're always waiting on a timeout of 2 minutes, you can make your mapWithState stream only output the data once it time out is triggered.

What would this mean for your code? It would mean that you now need to monitor timeout instead of outputting the tuple in each iteration. I would imagine your mapWithState will look something along the lines of:

def updateState(key: String,
                value: Option[(Long, Long, Long, List[String])],
                state: State[(Long, Long, Long, List[String])]): Option[(Long, Long, Long, List[String])] = {
  def reduce(first: (Long, Long, Long, List[String]), second: (Long, Long, Long, List[String])) = {
    (Math.min(first._1, second._1), Math.max(first._2, second._2), first._3 + second._3, first._4 ++ second._4)
  }

  value match {
    case Some(currentValue) =>
      val result = state
        .getOption()
        .map(currentState => reduce(currentState, currentValue))
        .getOrElse(currentValue)
      state.update(result)
      None
    case _ if state.isTimingOut() => state.getOption()
  }
}

This way, you only output something externally to the stream if the state has timed out, otherwise you aggregate it inside the state.

This means that your Spark DStream graph can filter out all values which aren't defined, and only keep those which are:

latestSessionInfo
 .mapWithState(spec)
 .filter(_.isDefined)

After filter, you'll only have states which have timed out.

来源：https://stackoverflow.com/questions/40786904/how-to-extract-timed-out-sessions-using-mapwithstate

标签

scala

apache-spark

spark-streaming