Batch lookup data for Spark streaming

时间秒杀一切 提交于 2020-01-30 08:13:11

问题


I need to look up some data in a Spark-streaming job from a file on HDFS This data is fetched once a day by a batch job.
Is there a "design pattern" for such a task?

  • how can I reload the data in memory (a hashmap) immediately after a
    daily update?
  • how to serve the streaming job continously while this lookup data is
    being fetched?

回答1:


One possible approach is to drop local data structures and use stateful stream instead. Lets assume you have main data stream called mainStream:

val mainStream: DStream[T] = ???

Next you can create another stream which reads lookup data:

val lookupStream: DStream[(K, V)] = ???

and a simple function which can be used to update state

def update(
  current: Seq[V],  // A sequence of values for a given key in the current batch
  prev: Option[V]   // Value for a given key from in the previous state
): Option[V] = { 
  current
    .headOption    // If current batch is not empty take first element 
    .orElse(prev)  // If it is empty (None) take previous state
 }

This two pieces can be used to create state:

val state = lookup.updateStateByKey(update)

All whats left is to key-by mainStream and connect data:

def toPair(t: T): (K, T) = ???

mainStream.map(toPair).leftOuterJoin(state)

While this is probably less than optimal from a performance point of view it leverages architecture which is already in place and frees you from manually dealing with invalidation or failure recovery.



来源:https://stackoverflow.com/questions/37447393/batch-lookup-data-for-spark-streaming

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!