Batch lookup data for Spark streaming

后端未结

关注

 1  1433

I need to look up some data in a Spark-streaming job from a file on HDFS This data is fetched once a day by a batch job.
Is there a \"design pattern\" for such a t

相关标签:

1条回答

时光取名叫无心

2021-01-25 04:00
One possible approach is to drop local data structures and use stateful stream instead. Lets assume you have main data stream called mainStream:
```
val mainStream: DStream[T] = ???
```
Next you can create another stream which reads lookup data:
```
val lookupStream: DStream[(K, V)] = ???
```
and a simple function which can be used to update state
```
def update(
  current: Seq[V],  // A sequence of values for a given key in the current batch
  prev: Option[V]   // Value for a given key from in the previous state
): Option[V] = { 
  current
    .headOption    // If current batch is not empty take first element 
    .orElse(prev)  // If it is empty (None) take previous state
 }
```
This two pieces can be used to create state:
```
val state = lookup.updateStateByKey(update)
```
All whats left is to key-by mainStream and connect data:
```
def toPair(t: T): (K, T) = ???

mainStream.map(toPair).leftOuterJoin(state)
```
While this is probably less than optimal from a performance point of view it leverages architecture which is already in place and frees you from manually dealing with invalidation or failure recovery.
0 讨论(0)
发布评论:

提交评论
- 加载中...