I have implemented Spark Streaming using createDirectStream. My Kafka producer is sending several messages every second to a topic with two partitions.
On Spark streaming side, i read kafka messages every second and them I'm windowing them on 5 second window size and frequency.
Kafka message are properly processed, i'm seeing the right computations and prints.
But in Spark Web UI, under Streaming section, it is showing number of events per window as Zero. Please see this image:
I'm puzzled why is it showing Zero, shouldn't it show number of Kafka messages being feed into Spark Stream?
Updated:
This issue seems to be happening when i use groupByKeyAndWindow() api. When i commented out this api usage from my code, Spark Streaming UI started reporting Kafka event input size correctly.
Any idea why is this so? Could this a defect in Spark Streaming?
I'm using Cloudera CDH: 5.5.1, Spark: 1.5.0, Kafka: KAFKA-0.8.2.0-1.kafka1.4.0.p0.56
It seems that it is not recorded by the Spark Kafka library code.
Based on the code of Spark 2.3.1
- Search
Input Size / Records
, found it is the value ofstageData.inputBytes
(StagePage.scala) - Search
StageData
andinputBytes
, found it is the value ofmetrics.inputMetrics.bytesRead
(LiveEntity.scala) - Search
bytesRead
, found it's set inHadoopRDD.scala
,FileScanRDD.scala
andShuffleSuite.scala
. But not in any Kafka related files.
来源:https://stackoverflow.com/questions/37070118/spark-streaming-kafka-createdirectstream-spark-ui-shows-input-event-size-as-ze