spark streaming update_state_by_keys for arrays aggregation

后端 未结 1 1294
盖世英雄少女心
盖世英雄少女心 2021-01-21 04:01

I have input lines like below

t1, file1, 1, 1, 1

t1, file1, 1, 2, 3

t1, file2, 2, 2, 2, 2

t2, file1, 5, 5, 5

t2, file2, 1, 1, 2, 2

<
相关标签:
1条回答
  • 2021-01-21 04:19

    What you're looking for is updateStateByKey. For DStream[(T, U)] it should take a function with two arguments:

    • Seq[U] - representing state for current window
    • Option[U] - representing accumulated state

    and return Option[U].

    Given your code it could be implemented for example like this:

    import breeze.linalg.{DenseVector => BDV}
    import scala.util.Try
    
    val state: DStream[(String, Array[Int])] = parsedStream.updateStateByKey(
      (current: Seq[Array[Int]], prev: Option[Array[Int]]) =>  {
        prev.map(_ +: current).orElse(Some(current))
        .flatMap(as => Try(as.map(BDV(_)).reduce(_ + _).toArray).toOption)
    })
    

    To be able to use it you'll have to configure checkpointing.

    0 讨论(0)
提交回复
热议问题