Spark structured streaming with kafka leads to only one batch (Pyspark)

问题

I have the following code and I'm wondering why it generates only one batch:

df = spark.readStream.format("kafka").option("kafka.bootstrap.servers", "IP").option("subscribe", "Topic").option("startingOffsets","earliest").load()
// groupby on slidings windows
query = slidingWindowsDF.writeStream.queryName("bla").outputMode("complete").format("memory").start()

The application is launched with the following parameters:

spark.streaming.backpressure.initialRate 5
spark.streaming.backpressure.enabled True

The kafka topic contains around 11 million messages. I'm expecting that it should at least generate two batches due to the initialRate parameter, but it generates only one. Can anyone tell why spark is processing my code in only one batch?

I'm using Spark 2.2.1 and Kafka 1.0.

回答1:

That is because spark.streaming.backpressure.initialRate parameter is used only by old Spark Streaming, not Structured Streaming.

Instead, use maxOffsetsPerTrigger: http://spark.apache.org/docs/latest/structured-streaming-kafka-integration.html

BTW, see also this answer: How Spark Structured Streaming handles backpressure?, SSS now don't have full backpressure support

来源：https://stackoverflow.com/questions/50527893/spark-structured-streaming-with-kafka-leads-to-only-one-batch-pyspark

标签

apache-spark

pyspark

apache-kafka

易学教程内所有资源均来自网络或用户发布的内容，如有违反法律规定的内容欢迎反馈！
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!