问题
I have a requirement wherein I am using DStream to retrieve the messages from Kafka. Now after getting message or RDD now i use a map operation to process the messages independently on the executors. The one challenge I am facing is i need to read/write to a hive table from within the executors and for this i need access to SQLContext. But as far as i know SparkSession is available at driver side only and should not be used within the executors. Now without the spark session (in spark 2.1.1) i can't get hold of SQLContext. To summarize My driver codes looks something like:
if (inputDStream_obj.isSuccess) {
val inputDStream = inputDStream_obj.get
inputDStream.foreachRDD(rdd => {
if (!rdd.isEmpty) {
val rdd1 = rdd.map(idocMessage => SegmentLoader.processMessage(props, idocMessage.value(), true))
}
}
So after this rdd.map the next code is executed on the executors and there I have something like:
val sqlContext = spark.sqlContext
import sqlContext.implicits._
spark.sql("USE " + databaseName)
val result = Try(df.write.insertInto(tableName))
Passing sparksession or sqlcontext gives error when they are used on the executor:
When I try to obtain the existing sparksession:
org.apache.spark.SparkException: A master URL must be set in your configuration
When I broadcast session variable:
User class threw exception: org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in stage 0.0 failed 4 times, most recent failure: Lost task 1.3 in stage 0.0 (TID 9, <server>, executor 2): java.lang.NullPointerException
When i pass sparksession object:
User class threw exception: org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in stage 0.0 failed 4 times, most recent failure: Lost task 1.3 in stage 0.0 (TID 9, <server>, executor 2): java.lang.NullPointerException
Let me know if you can suggest how to query/update a hive table from within the executors.
Thanks, Ritwick
来源:https://stackoverflow.com/questions/44495803/how-to-read-write-a-hive-table-from-within-the-spark-executors