问题
I'm trying to implement a Streaming job that uses a custom receiver to read messages from SQS. Each message contains a single reference to an S3 file which I would then like to read, parse, and store as ORC.
Here is the code I have so far:
val sc = new SparkContext(conf)
val streamContext = new StreamingContext(sc, Seconds(5))
val sqs = streamContext.receiverStream(new SQSReceiver("events-elb")
.credentials("accessKey", "secretKey")
.at(Regions.US_EAST_1)
.withTimeout(5))
val s3File = sqs.map(messages => {
val sqsMsg: JsValue = Json.parse(messages)
val s3Key = "s3://" +
Json.stringify(sqsMsg("Records")(0)("s3")("bucket")("name")).replace("\"", "") + "/" +
Json.stringify(sqsMsg("Records")(0)("s3")("object")("key")).replace("\"", "")
val rawLogs = sc.textFile(s3Key)
rawLogs
}).saveAsTextFiles("/tmp/output")
Unfortunately, this fails with the following error:
Caused by: java.io.NotSerializableException: org.apache.spark.SparkContext
Serialization stack:
- object not serializable (class: org.apache.spark.SparkContext, value: org.apache.spark.SparkContext@52fc5eb1)
- field (class: SparrowOrc$$anonfun$1, name: sc$1, type: class org.apache.spark.SparkContext)
- object (class SparrowOrc$$anonfun$1, <function1>)
at org.apache.spark.serializer.SerializationDebugger$.improveException(SerializationDebugger.scala:40)
at org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:46)
at org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:100)
at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:295)
Is this an incorrect way to use sc.textFile
? If so, what method might I use to forward each filepath I receive from SQS to a file reader for processing?
FWIW, val s3File
ends up being of type mappedDStream
.
For further context, I'm using this as my receiver: https://github.com/imapi/spark-sqs-receiver.
回答1:
Indeed, we cannot use sparkContext
in a map
operation, as the closure, converted in a stage, is run in the executors, where there's no SparkContext
defined.
The way to approach this is to split the process in two: First, we calculate the files using the existing map
, but to make use of textFile
in a transform
operation:
val s3Keys = sqs.map(messages => {
val sqsMsg: JsValue = Json.parse(messages)
val s3Key = "s3://" +
Json.stringify(sqsMsg("Records")(0)("s3")("bucket")("name")).replace("\"", "") + "/" +
Json.stringify(sqsMsg("Records")(0)("s3")("object")("key")).replace("\"", "")
}
val files DStream = s3Keys.transform{keys =>
val fileKeys= keys.collect()
Val files = fileKeys.map(f=>
sparkContext.textFile(f))
sparkContext.union(files)
}
filesDStream.saveAsTextFiles(..)
回答2:
No. It's not correct since SparkContext is:
- not serializable (as you see in the logs)
- it would not make sense
I'm so thankful to Spark devs that they took care of it so we won't forget about it.
The reason for not allowing such use is that SparkContext
lives on the driver (or one could say constitutes the driver) and is responsible for orchestrating tasks (for Spark jobs).
Executors are dumb and as such know only how to run tasks.
Spark does not work like this and the sooner you accept that design decision the more proficient you become in properly developing Spark applications.
If so, what method might I use to forward each filepath I receive from SQS to a file reader for processing?
That's I cannot answer as I've never developed a custom receiver.
来源:https://stackoverflow.com/questions/44769780/can-sparkcontext-textfile-be-used-with-a-custom-receiver