Accessing Collection of DStreams

前端 未结 1 1974
佛祖请我去吃肉
佛祖请我去吃肉 2021-01-25 00:37

I am trying to access a collection of filtered DStreams obtained like in the solution to this question: Spark Streaming - Best way to Split Input Stream based on filter Param

相关标签:
1条回答
  • 2021-01-25 00:49

    As the exception indicates, the DStream definition is being captured by the closure. A simple option is to declare this DStream transient:

    @transient val spamTagStream = //KafkaUtils.create...
    

    @transient flags certain objects to be removed from the Java serialization of the object graph of some object. The key of this scenario is that some val declared in the same scope as the DStream (statusCodeStreams in this case) is used within the closure. The actual reference of that val from within the closure is outer.statusCodeStreams, causing that the serialization process to "pull" all context of outer into the closure. With @transient we mark the DStream (and also the StreamingContext) declarations as non-serializable and we avoid the serialization issue. Depending on the code structure (if it's all linear in one main function (bad practice, btw) it might be necessary to mark ALL DStream declarations + the StreamingContext instance as @transient.

    If the only intent of the initial filtering is to 'route' the content to separate Kafka topics, it might be worth moving the filtering within the foreachRDD. That would make for a simpler program structure.

    spamTagStream.foreachRDD{ rdd => 
        rdd.cache()
        statuCodes.map{code =>
            val matchingCodes = rdd.filter(...)
            matchingCodes.foreachPartition{write to kafka}
        }
        rdd.unpersist(true)
    }
    
    0 讨论(0)
提交回复
热议问题