Spark Streaming Kafka direct stream processing time performance spikes

两盒软妹~` 提交于 2019-12-13 15:51:27

问题


I have a Spark Streaming job that reads data from a Kafka cluster using the direct approach. There is a cyclical spike in processing times that I cannot understand and is not reflected in the Spark UI metrics. The following image shows this pattern (batch time = 10s):

This issue is reproducible every time the job is run. There is no data in the Kafka logs to be read so there is no real processing, of note, to perform. I would expect the line to be flat, near the minimum value to serialize and send the tasks to the executors.

The pattern is a job takes 9 seconds (this has 5 seconds of scheduler delay), the next job takes 5 seconds (has no scheduler delay) the next two jobs take roughly 0.8 and 0.2 seconds.

The 9 and 5 second jobs don't appear to do more work, according to the Spark UI (apart from scheduler delay).

Here is the task time summary for the 5 second job:

None of the executors are taking anywhere near 5 seconds to complete their tasks.

Has anyone else experienced this or do you have any suggestions what may be causing this?

Here is a stripped down version of the main streaming code:

def main(args: Array[String]): Unit = {
    val (runtimeConfig: RuntimeConfig, cassandraConfig: CassandraConfig.type, kafkaConfig: KafkaConfig.type,
streamingContext: StreamingContext) = loadConfig(args)

    val kafkaParams = Map[String, String]("metadata.broker.list" -> kafkaConfig.metadataBrokerList, "fetch.message.max.bytes" -> kafkaConfig.fetchMessageMaxBytes)
    val messages = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder] (streamingContext, kafkaParams, Set(runtimeConfig.kafkaTopic))
    val uuidGenerator = streamingContext.sparkContext.broadcast(Generators.timeBasedGenerator(EthernetAddress.fromInterface()))

    runtimeConfig.kafkaTopic match {
      case Topics.edges => saveEdges(runtimeConfig, messages, uuidGenerator)
      case Topics.messages => {val formatter = streamingContext.sparkContext.broadcast(DateTimeFormat.forPattern(AppConfig.dateFormat))
    saveMessages(cassandraConfig, runtimeConfig, messages, formatter)}
    }
    streamingContext.start()
    streamingContext.awaitTermination()
}

def saveEdges(runtimeConfig: RuntimeConfig, kafkaStream: DStream[(String, String)],
               uuidGenerator: Broadcast[TimeBasedGenerator]): Unit = {
      val edgesMessages = kafkaStream.flatMap(msg => {
    implicit val formats = DefaultFormats
    parse(msg._2).extract[List[EdgeMessage]].flatMap(em => (List.fill(em.ids.size)(em.userId) zip em.ids))
  }).map(edge => Edge(edge._1, edge._2)).saveAsTextFiles("tester", ".txt")
}

Spark settings:

val conf = new SparkConf()
.set("spark.mesos.executor.home", AppConfig.sparkHome)
.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
.set("spark.streaming.kafka.maxRatePerPartition", "1")
.set("spark.streaming.blockInterval", "500")
.set("spark.cores.max", "36")

Relevant build.sbt extract:

"org.apache.spark" % "spark-streaming-kafka_2.10"  % "1.5.1",
"org.apache.spark" %% "spark-core" % "1.5.1",
"org.apache.spark" %% "spark-streaming" % "1.5.1",
"org.apache.spark" %% "spark-graphx" % "1.5.1",
  • Kafka version: 2-10-0.8.2.1
  • Resource manager: Mesos 0.23
  • Cluster Details: 6 Spark Workers, 6 Kafka Brokers, 5 node Zookeeper Ensemble (on same machines). 12 Kafka partitions.

Note: sparktmp and kafka-logs are generally located on the same spinning disks on each node.


回答1:


The problem seems to be with the Mesos scheduler. I'm not sure exactly why it starts slowing down jobs like this. However I restarted the Mesos cluster and now the saw-tooth processing times are gone.

As you can see here the processing times are now more stationary:



来源:https://stackoverflow.com/questions/34002565/spark-streaming-kafka-direct-stream-processing-time-performance-spikes

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!