How to count new element from stream by using spark-streaming

问题

I have done implementation of daily compute. Here is some pseudo-code. "newUser" may called first activated user.

// Get today log from hbase or somewhere else
val log = getRddFromHbase(todayDate)
// Compute active user
val activeUser = log.map(line => ((line.uid, line.appId), line).reduceByKey(distinctStrategyMethod)
// Get history user from hdfs
val historyUser = loadFromHdfs(path + yesterdayDate)
// Compute new user from active user and historyUser
val newUser = activeUser.subtractByKey(historyUser)
// Get new history user
val newHistoryUser = historyUser.union(newUser)
// Save today history user
saveToHdfs(path + todayDate)

Computation of "activeUser" can be converted to spark-streaming easily. Here is some code:

    val transformedLog = sdkLogDs.map(sdkLog => {
      val time = System.currentTimeMillis()
      val timeToday = ((time - (time + 3600000 * 8) % 86400000) / 1000).toInt
      ((sdkLog.appid, sdkLog.bcode, sdkLog.uid), (sdkLog.channel_no, sdkLog.ctime.toInt, timeToday))
    })
    val activeUser = transformedLog.groupByKeyAndWindow(Seconds(86400), Seconds(60)).mapValues(x => {
      var firstLine = x.head
      x.foreach(line => {
        if (line._2 < firstLine._2) firstLine = line
      })
      firstLine
    })

But the approach of "newUser" and "historyUser" is confusing me.
I think my question can be summarized as "how to count new element from stream". As my pseudo-code above, "newUser" is part of "activeUser". And I must maintain a set of "historyUser" to know which part is "newUser".

I consider an approach, but I think it may not work right way:
Load the history user as a RDD. Foreach DStream of "activeUser" and find the elements doesn't exist in the "historyUser". A problem here is when should I update this RDD of "historyUser" to make sure I can get the right "newUser" of a window.
Update the "historyUser" RDD means add "newUser" to it. Just like what I did in the pseudo-code above. The "historyUser" is updated once a day in that code. Another problem is how to do this update RDD operation from a DStream. I think update "historyUser" when window slides is proper. But I haven't find a proper API to do this.
So which is the best practice to solve this problem.

回答1:

updateStateByKey would help here as it allows you to set initial state (your historical users) and then update it on each interval of your main stream. I put some code together to explain the concept

val historyUsers = loadFromHdfs(path + yesterdayDate).map(UserData(...))

case class UserStatusState(isNew: Boolean, values: UserData)

// this will prepare the RDD of already known historical users
// to pass into updateStateByKey as initial state
val initialStateRDD = historyUsers.map(user => UserStatusState(false, user))

// stateful stream
val trackUsers = sdkLogDs.updateStateByKey(updateState, new HashPartitioner(sdkLogDs.ssc.sparkContext.defaultParallelism), true, initialStateRDD)
// only new users
val newUsersStream = trackUsers.filter(_._2.isNew)


def updateState(newValues: Seq[UserData], prevState: Option[UserStatusState]): Option[UserStatusState] = {
  // Group all values for specific user as needed
  val groupedUserData: UserData = newValues.reduce(...)

  // prevState is defined only for users previously seen in the stream
  // or loaded as initial state from historyUsers RDD
  // For new users it is None
  val isNewUser = !prevState.isDefined
  // as you return state here for the user - prevState won't be None on next iterations
  Some(UserStatusState(isNewUser, groupedUserData))

}

来源：https://stackoverflow.com/questions/34786117/how-to-count-new-element-from-stream-by-using-spark-streaming

标签

scala

apache-spark

spark-streaming