Can SparkContext.textFile be used with a custom receiver?

扶醉桌前 提交于 2019-12-11 04:23:27

问题


I'm trying to implement a Streaming job that uses a custom receiver to read messages from SQS. Each message contains a single reference to an S3 file which I would then like to read, parse, and store as ORC.

Here is the code I have so far:

val sc = new SparkContext(conf)
val streamContext = new StreamingContext(sc, Seconds(5))

val sqs = streamContext.receiverStream(new SQSReceiver("events-elb")
  .credentials("accessKey", "secretKey")
  .at(Regions.US_EAST_1)
  .withTimeout(5))

val s3File = sqs.map(messages => {
  val sqsMsg: JsValue = Json.parse(messages)
  val s3Key = "s3://" +
    Json.stringify(sqsMsg("Records")(0)("s3")("bucket")("name")).replace("\"", "") + "/" +
    Json.stringify(sqsMsg("Records")(0)("s3")("object")("key")).replace("\"", "")
  val rawLogs = sc.textFile(s3Key)

  rawLogs
}).saveAsTextFiles("/tmp/output")

Unfortunately, this fails with the following error:

Caused by: java.io.NotSerializableException: org.apache.spark.SparkContext
Serialization stack:
    - object not serializable (class: org.apache.spark.SparkContext, value: org.apache.spark.SparkContext@52fc5eb1)
    - field (class: SparrowOrc$$anonfun$1, name: sc$1, type: class org.apache.spark.SparkContext)
    - object (class SparrowOrc$$anonfun$1, <function1>)
at org.apache.spark.serializer.SerializationDebugger$.improveException(SerializationDebugger.scala:40)
at org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:46)
at org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:100)
at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:295)

Is this an incorrect way to use sc.textFile? If so, what method might I use to forward each filepath I receive from SQS to a file reader for processing?

FWIW, val s3File ends up being of type mappedDStream.

For further context, I'm using this as my receiver: https://github.com/imapi/spark-sqs-receiver.


回答1:


Indeed, we cannot use sparkContext in a map operation, as the closure, converted in a stage, is run in the executors, where there's no SparkContext defined.

The way to approach this is to split the process in two: First, we calculate the files using the existing map, but to make use of textFile in a transform operation:

val s3Keys = sqs.map(messages => {
  val sqsMsg: JsValue = Json.parse(messages)
  val s3Key = "s3://" +
  Json.stringify(sqsMsg("Records")(0)("s3")("bucket")("name")).replace("\"", "") + "/" +
  Json.stringify(sqsMsg("Records")(0)("s3")("object")("key")).replace("\"", "")
}
val files DStream = s3Keys.transform{keys => 
    val fileKeys= keys.collect()
    Val files = fileKeys.map(f=>
      sparkContext.textFile(f))
    sparkContext.union(files)
}
filesDStream.saveAsTextFiles(..)



回答2:


No. It's not correct since SparkContext is:

  1. not serializable (as you see in the logs)
  2. it would not make sense

I'm so thankful to Spark devs that they took care of it so we won't forget about it.

The reason for not allowing such use is that SparkContext lives on the driver (or one could say constitutes the driver) and is responsible for orchestrating tasks (for Spark jobs).

Executors are dumb and as such know only how to run tasks.

Spark does not work like this and the sooner you accept that design decision the more proficient you become in properly developing Spark applications.

If so, what method might I use to forward each filepath I receive from SQS to a file reader for processing?

That's I cannot answer as I've never developed a custom receiver.



来源:https://stackoverflow.com/questions/44769780/can-sparkcontext-textfile-be-used-with-a-custom-receiver

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!