Is it possible to limit the size of the batches returned by the Kafka consumer for Spark Streaming?
I am asking because the first batch I get has hundred of millions
Apart from above answers. Batch size is product of 3 parameters
batchDuration
: The time interval at which streaming data will be divided into batches (in Seconds).spark.streaming.kafka.maxRatePerPartition
: set the maximum number of messages per partition per second. This when combined with batchDuration
will control the batch size. You want the maxRatePerPartition
to be set, and large (otherwise you are effectively throttling your job) and batchDuration
to be very small.For better explaination how this product work when backpressure enable/disable (set spark.streaming.kafka.maxRatePerPartition for createDirectStream)
Limiting the Max batch size will greatly help to control the processing time, however, it increase the processing latency of message.
By settings below properties, we could control the batch size spark.streaming.receiver.maxRate= spark.streaming.kafka.maxRatePerPartition=
You could even dynamically set the batch size based on processing time, by enabling the back pressure spark.streaming.backpressure.enabled:true spark.streaming.backpressure.initialRate:
I think your problem can be solved by Spark Streaming Backpressure.
Check spark.streaming.backpressure.enabled
and spark.streaming.backpressure.initialRate
.
By default spark.streaming.backpressure.initialRate
is not set and spark.streaming.backpressure.enabled
is disabled by default so I suppose spark will take as much as he can.
From Apache Spark Kafka configuration
spark.streaming.backpressure.enabled
:
This enables the Spark Streaming to control the receiving rate based on the current batch scheduling delays and processing times so that the system receives only as fast as the system can process. Internally, this dynamically sets the maximum receiving rate of receivers. This rate is upper bounded by the values
spark.streaming.receiver.maxRate
andspark.streaming.kafka.maxRatePerPartition
if they are set (see below).
And since you want to control first batch, or to be more specific - number of messages in first batch, I think you need spark.streaming.backpressure.initialRate
spark.streaming.backpressure.initialRate
:
This is the initial maximum receiving rate at which each receiver will receive data for the first batch when the backpressure mechanism is enabled.
This one is good when your Spark job (respectively Spark workers at all) is able to process let say 10000 messages from kafka, but kafka brokers give to your job 100000 messages.
Maybe you will be also interested to check spark.streaming.kafka.maxRatePerPartition
and also some research and suggestions for these properties on real example by Jeroen van Wilgenburg on his blog.