I want to use SparkContext and SQLContext inside foreachPartition
, but unable to do it due to serialization error. I know that both objects are not serializable
I found out that using an existing SparkContext (assume I have created a sparkContext sc beforehand) inside a loop works i.e.
// this works
stream.foreachRDD( _ => {
// update rdd
.... = SparkContext.getOrCreate().parallelize(...)
})
// this doesn't work - throws a SparkContext not serializable error
stream.foreachRDD( _ => {
// update rdd
.... = sc.parallelize(...)
})
It is not possible. SparkContext
, SQLContext
and SparkSession
can be used only on the driver. You can use sqlContext in the top level of foreachRDD
:
myDStream.foreachRDD(rdd => {
val df = sqlContext.createDataFrame(rdd, schema)
...
})
You cannot use it in transformation / action:
myDStream.foreachRDD(rdd => {
rdd.foreach {
val df = sqlContext.createDataFrame(...)
...
}
})
You probably want equivalent of:
myDStream.foreachRDD(rdd => {
val foo = rdd.mapPartitions(iter => doSomethingWithRedisClient(iter))
val df = sqlContext.createDataFrame(foo, schema)
df.write.parquet("s3n://bucket/pathToSentMessages)
})