Batch lookup data for Spark streaming

问题

I need to look up some data in a Spark-streaming job from a file on HDFS This data is fetched once a day by a batch job.
Is there a "design pattern" for such a task?

how can I reload the data in memory (a hashmap) immediately after a
daily update?
how to serve the streaming job continously while this lookup data is
being fetched?

回答1:

One possible approach is to drop local data structures and use stateful stream instead. Lets assume you have main data stream called mainStream:

val mainStream: DStream[T] = ???

Next you can create another stream which reads lookup data:

val lookupStream: DStream[(K, V)] = ???

and a simple function which can be used to update state

def update(
  current: Seq[V],  // A sequence of values for a given key in the current batch
  prev: Option[V]   // Value for a given key from in the previous state
): Option[V] = { 
  current
    .headOption    // If current batch is not empty take first element 
    .orElse(prev)  // If it is empty (None) take previous state
 }

This two pieces can be used to create state:

val state = lookup.updateStateByKey(update)

All whats left is to key-by mainStream and connect data:

def toPair(t: T): (K, T) = ???

mainStream.map(toPair).leftOuterJoin(state)

While this is probably less than optimal from a performance point of view it leverages architecture which is already in place and frees you from manually dealing with invalidation or failure recovery.

来源：https://stackoverflow.com/questions/37447393/batch-lookup-data-for-spark-streaming

标签

scala

apache-spark

spark-streaming

易学教程内所有资源均来自网络或用户发布的内容，如有违反法律规定的内容欢迎反馈！
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!