I need to look up some data in a Spark-streaming job from a file on HDFS
This data is fetched once a day by a batch job.
Is there a \"design pattern\" for such a t
One possible approach is to drop local data structures and use stateful stream instead. Lets assume you have main data stream called mainStream
:
val mainStream: DStream[T] = ???
Next you can create another stream which reads lookup data:
val lookupStream: DStream[(K, V)] = ???
and a simple function which can be used to update state
def update(
current: Seq[V], // A sequence of values for a given key in the current batch
prev: Option[V] // Value for a given key from in the previous state
): Option[V] = {
current
.headOption // If current batch is not empty take first element
.orElse(prev) // If it is empty (None) take previous state
}
This two pieces can be used to create state:
val state = lookup.updateStateByKey(update)
All whats left is to key-by mainStream
and connect data:
def toPair(t: T): (K, T) = ???
mainStream.map(toPair).leftOuterJoin(state)
While this is probably less than optimal from a performance point of view it leverages architecture which is already in place and frees you from manually dealing with invalidation or failure recovery.