Batch lookup data for Spark streaming

后端 未结 1 1434
余生分开走
余生分开走 2021-01-25 03:35

I need to look up some data in a Spark-streaming job from a file on HDFS This data is fetched once a day by a batch job.
Is there a \"design pattern\" for such a t

1条回答
  •  时光取名叫无心
    2021-01-25 04:00

    One possible approach is to drop local data structures and use stateful stream instead. Lets assume you have main data stream called mainStream:

    val mainStream: DStream[T] = ???
    

    Next you can create another stream which reads lookup data:

    val lookupStream: DStream[(K, V)] = ???
    

    and a simple function which can be used to update state

    def update(
      current: Seq[V],  // A sequence of values for a given key in the current batch
      prev: Option[V]   // Value for a given key from in the previous state
    ): Option[V] = { 
      current
        .headOption    // If current batch is not empty take first element 
        .orElse(prev)  // If it is empty (None) take previous state
     }
    

    This two pieces can be used to create state:

    val state = lookup.updateStateByKey(update)
    

    All whats left is to key-by mainStream and connect data:

    def toPair(t: T): (K, T) = ???
    
    mainStream.map(toPair).leftOuterJoin(state)
    

    While this is probably less than optimal from a performance point of view it leverages architecture which is already in place and frees you from manually dealing with invalidation or failure recovery.

    0 讨论(0)
提交回复
热议问题