How to use SQLContext and SparkContext inside foreachPartition

后端 未结 2 606
逝去的感伤
逝去的感伤 2020-12-09 07:09

I want to use SparkContext and SQLContext inside foreachPartition, but unable to do it due to serialization error. I know that both objects are not serializable

相关标签:
2条回答
  • 2020-12-09 07:32

    I found out that using an existing SparkContext (assume I have created a sparkContext sc beforehand) inside a loop works i.e.

    // this works
    stream.foreachRDD( _ => {
        // update rdd
        .... = SparkContext.getOrCreate().parallelize(...)
    })
    
    // this doesn't work - throws a SparkContext not serializable error
    stream.foreachRDD( _ => {
        // update rdd
        .... = sc.parallelize(...)
    })
    
    0 讨论(0)
  • 2020-12-09 07:45

    It is not possible. SparkContext, SQLContext and SparkSession can be used only on the driver. You can use sqlContext in the top level of foreachRDD:

     myDStream.foreachRDD(rdd => {
         val df = sqlContext.createDataFrame(rdd, schema)
         ... 
     })
    

    You cannot use it in transformation / action:

    myDStream.foreachRDD(rdd => {
         rdd.foreach { 
            val df = sqlContext.createDataFrame(...)
            ... 
         }
     })
    

    You probably want equivalent of:

    myDStream.foreachRDD(rdd => {
       val foo = rdd.mapPartitions(iter => doSomethingWithRedisClient(iter))
       val df = sqlContext.createDataFrame(foo, schema)
       df.write.parquet("s3n://bucket/pathToSentMessages)
    })
    
    0 讨论(0)
提交回复
热议问题