问题
I have a Spark Streaming job that reads data from a Kafka cluster using the direct approach. There is a cyclical spike in processing times that I cannot understand and is not reflected in the Spark UI metrics. The following image shows this pattern (batch time = 10s):
This issue is reproducible every time the job is run. There is no data in the Kafka logs to be read so there is no real processing, of note, to perform. I would expect the line to be flat, near the minimum value to serialize and send the tasks to the executors.
The pattern is a job takes 9 seconds (this has 5 seconds of scheduler delay), the next job takes 5 seconds (has no scheduler delay) the next two jobs take roughly 0.8 and 0.2 seconds.
The 9 and 5 second jobs don't appear to do more work, according to the Spark UI (apart from scheduler delay).
Here is the task time summary for the 5 second job:
None of the executors are taking anywhere near 5 seconds to complete their tasks.
Has anyone else experienced this or do you have any suggestions what may be causing this?
Here is a stripped down version of the main streaming code:
def main(args: Array[String]): Unit = {
val (runtimeConfig: RuntimeConfig, cassandraConfig: CassandraConfig.type, kafkaConfig: KafkaConfig.type,
streamingContext: StreamingContext) = loadConfig(args)
val kafkaParams = Map[String, String]("metadata.broker.list" -> kafkaConfig.metadataBrokerList, "fetch.message.max.bytes" -> kafkaConfig.fetchMessageMaxBytes)
val messages = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder] (streamingContext, kafkaParams, Set(runtimeConfig.kafkaTopic))
val uuidGenerator = streamingContext.sparkContext.broadcast(Generators.timeBasedGenerator(EthernetAddress.fromInterface()))
runtimeConfig.kafkaTopic match {
case Topics.edges => saveEdges(runtimeConfig, messages, uuidGenerator)
case Topics.messages => {val formatter = streamingContext.sparkContext.broadcast(DateTimeFormat.forPattern(AppConfig.dateFormat))
saveMessages(cassandraConfig, runtimeConfig, messages, formatter)}
}
streamingContext.start()
streamingContext.awaitTermination()
}
def saveEdges(runtimeConfig: RuntimeConfig, kafkaStream: DStream[(String, String)],
uuidGenerator: Broadcast[TimeBasedGenerator]): Unit = {
val edgesMessages = kafkaStream.flatMap(msg => {
implicit val formats = DefaultFormats
parse(msg._2).extract[List[EdgeMessage]].flatMap(em => (List.fill(em.ids.size)(em.userId) zip em.ids))
}).map(edge => Edge(edge._1, edge._2)).saveAsTextFiles("tester", ".txt")
}
Spark settings:
val conf = new SparkConf()
.set("spark.mesos.executor.home", AppConfig.sparkHome)
.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
.set("spark.streaming.kafka.maxRatePerPartition", "1")
.set("spark.streaming.blockInterval", "500")
.set("spark.cores.max", "36")
Relevant build.sbt extract:
"org.apache.spark" % "spark-streaming-kafka_2.10" % "1.5.1",
"org.apache.spark" %% "spark-core" % "1.5.1",
"org.apache.spark" %% "spark-streaming" % "1.5.1",
"org.apache.spark" %% "spark-graphx" % "1.5.1",
- Kafka version: 2-10-0.8.2.1
- Resource manager: Mesos 0.23
- Cluster Details: 6 Spark Workers, 6 Kafka Brokers, 5 node Zookeeper Ensemble (on same machines). 12 Kafka partitions.
Note: sparktmp
and kafka-logs
are generally located on the same spinning disks on each node.
回答1:
The problem seems to be with the Mesos scheduler. I'm not sure exactly why it starts slowing down jobs like this. However I restarted the Mesos cluster and now the saw-tooth processing times are gone.
As you can see here the processing times are now more stationary:
来源:https://stackoverflow.com/questions/34002565/spark-streaming-kafka-direct-stream-processing-time-performance-spikes