Spark Streaming: Kafka group id not permitted in Spark Structured Streaming

前端未结

关注

 2  1012

名媛妹妹 2021-01-19 04:22

I am writing a Spark structured streaming application in PySpark to read data from Kafka.

However, the current version of Spark is 2.1.0, which does not allow me to

2条回答

南笙 (楼主)

2021-01-19 04:50
Setting group.id is now possible with Spark 3.x. See Structured Streaming + Kafka Integration Guide where it says:

kafka.group.id: The Kafka group id to use in Kafka consumer while reading from Kafka. Use this with caution. By default, each query generates a unique group id for reading data. This ensures that each Kafka source has its own consumer group that does not face interference from any other consumer, and therefore can read all of the partitions of its subscribed topics. In some scenarios (for example, Kafka group-based authorization), you may want to use a specific authorized group id to read data. You can optionally set the group id. However, do this with extreme caution as it can cause unexpected behavior. Concurrently running queries (both, batch and streaming) or sources with the same group id are likely interfere with each other causing each query to read only part of the data. This may also occur when queries are started/restarted in quick succession. To minimize such issues, set the Kafka consumer session timeout (by setting option "kafka.session.timeout.ms") to be very small. When this is set, option "groupIdPrefix" will be ignored.

However, this group.id is still not used to commit offsets back to Kafka and the offset management remains within Spark's checkpoint files. I have given more details (also for Spark < 3.x) in my answers:
- How to manually set group.id and commit kafka offsets in spark structured streaming?
- How to use kafka.group.id in spark 3.0
0 讨论(0)

查看其它2个回答
发布评论:

提交评论
- 加载中...