问题
I was going through the Spark structured streaming - Kafka integration guide here.
It is told at this link that
enable.auto.commit: Kafka source doesn’t commit any offset.
So how do I manually commit offsets once my spark application has successfully processed each record?
回答1:
Current Situation (Spark 2.4.5)
This feature seems to be under discussion in the Spark community https://github.com/apache/spark/pull/24613.
In that Pull Request you will also find a possible solution for this at https://github.com/HeartSaVioR/spark-sql-kafka-offset-committer.
At the moment, the Spark Structured Streaming + Kafka integration documentation clearly states how it manages Kafka offsets. The most important Kafka specific configurations for offsets are:
- group.id: Kafka source will create a unique group id for each query automatically. According to the code the group.id will be set to
val uniqueGroupId = s"spark-kafka-source-${UUID.randomUUID}-${metadataPath.hashCode}"
- auto.offset.reset: Set the source option startingOffsets to specify where to start instead. Structured Streaming manages which offsets are consumed internally, rather than rely on the kafka Consumer to do it.
- enable.auto.commit: Kafka source doesn’t commit any offset.
Therefore, in Structured Streaming it is currently not possible to define your custom group.id for Kafka Consumer and Structured Streaming is managing the offsets internally and not committing back to Kafka (also not automatically).
Current Situation in Action
Let's say you have a simple Spark Structured Streaming application that reads and writes to Kafka, like this:
// create SparkSession
val spark = SparkSession.builder()
.appName("ListenerTester")
.master("local[1]")
.getOrCreate()
// read from Kafka topic
val df = spark.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "localhost:9092")
.option("subscribe", "testingKafkaProducer")
.option("failOnDataLoss", "false")
.load()
// write to Kafka topic and set checkpoint directory for this stream
df.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)")
.writeStream
.format("kafka")
.option("kafka.bootstrap.servers", "localhost:9092")
.option("topic", "testingKafkaProducerOut")
.option("checkpointLocation", "/home/.../sparkCheckpoint/")
.start()
Offset Management by Spark
Once this application is submitted and data is being processed, the corresponding offset can be found in the checkpoint directory:
myCheckpointDir/offsets/
{"testingKafkaProducer":{"0":1}}
Here the entry in the checkpoint file confirms that the next offset of partition 0
to be consumed is 1
. It implies that the application already processes offset 0
from partition 0
of the topic named testingKafkaProducer
.
More on the fault-tolerance-semantics are given in the Spark Documentation.
Offset Management by Kafka
However, as stated in the documentation, the offset is not committed back to Kafka.
This can be checked by executing the kafka-consumer-groups.sh
of the Kafka installation.
./kafka/current/bin/kafka-consumer-groups.sh --bootstrap-server localhost:9092 --describe --group "spark-kafka-source-92ea6f85-[...]-driver-0"
TOPIC PARTITION CURRENT-OFFSET LOG-END-OFFSET LAG CONSUMER-ID HOST CLIENT-ID
testingKafkaProducer 0 - 1 - consumer-1-[...] /127.0.0.1 consumer-1
The current offset for this application is unknown to Kafka as it has never been committed.
Possible Solutions
What I have seen doing some research on the web is that you could commit offsets in the callback function of the onQueryProgress
method in a customized StreamingQueryListener
of Spark.
As I will not claim to have developed this myself, here are the most important links that helped me to understand:
Code Example for Listener
Discussion on SO around offset management
General description on the StreamingQueryListener
来源:https://stackoverflow.com/questions/50844449/how-to-manually-set-group-id-and-commit-kafka-offsets-in-spark-structured-stream