How to manually set group.id and commit kafka offsets in spark structured streaming?

问题

I was going through the Spark structured streaming - Kafka integration guide here.

It is told at this link that

enable.auto.commit: Kafka source doesn’t commit any offset.

So how do I manually commit offsets once my spark application has successfully processed each record?

回答1:

Current Situation (Spark 2.4.5)

This feature seems to be under discussion in the Spark community https://github.com/apache/spark/pull/24613.

In that Pull Request you will also find a possible solution for this at https://github.com/HeartSaVioR/spark-sql-kafka-offset-committer.

At the moment, the Spark Structured Streaming + Kafka integration documentation clearly states how it manages Kafka offsets. The most important Kafka specific configurations for offsets are:

group.id: Kafka source will create a unique group id for each query automatically. According to the code the group.id will be set to

val uniqueGroupId = s"spark-kafka-source-${UUID.randomUUID}-${metadataPath.hashCode}"

auto.offset.reset: Set the source option startingOffsets to specify where to start instead. Structured Streaming manages which offsets are consumed internally, rather than rely on the kafka Consumer to do it.
enable.auto.commit: Kafka source doesn’t commit any offset.

Therefore, in Structured Streaming it is currently not possible to define your custom group.id for Kafka Consumer and Structured Streaming is managing the offsets internally and not committing back to Kafka (also not automatically).

Current Situation in Action

Let's say you have a simple Spark Structured Streaming application that reads and writes to Kafka, like this:

// create SparkSession
val spark = SparkSession.builder()
  .appName("ListenerTester")
  .master("local[1]")
  .getOrCreate()

// read from Kafka topic
val df = spark.readStream
  .format("kafka")
  .option("kafka.bootstrap.servers", "localhost:9092")
  .option("subscribe", "testingKafkaProducer")
  .option("failOnDataLoss", "false")
  .load()

// write to Kafka topic and set checkpoint directory for this stream
df.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)")
  .writeStream
  .format("kafka")
  .option("kafka.bootstrap.servers", "localhost:9092")
  .option("topic", "testingKafkaProducerOut")
  .option("checkpointLocation", "/home/.../sparkCheckpoint/")
  .start()

Offset Management by Spark

Once this application is submitted and data is being processed, the corresponding offset can be found in the checkpoint directory:

myCheckpointDir/offsets/

{"testingKafkaProducer":{"0":1}}

Here the entry in the checkpoint file confirms that the next offset of partition 0 to be consumed is 1. It implies that the application already processes offset 0 from partition 0 of the topic named testingKafkaProducer.

More on the fault-tolerance-semantics are given in the Spark Documentation.

Offset Management by Kafka

However, as stated in the documentation, the offset is not committed back to Kafka. This can be checked by executing the kafka-consumer-groups.sh of the Kafka installation.

./kafka/current/bin/kafka-consumer-groups.sh --bootstrap-server localhost:9092 --describe --group "spark-kafka-source-92ea6f85-[...]-driver-0"

TOPIC                PARTITION  CURRENT-OFFSET  LOG-END-OFFSET  LAG  CONSUMER-ID      HOST         CLIENT-ID
testingKafkaProducer 0          -               1               -    consumer-1-[...] /127.0.0.1   consumer-1

The current offset for this application is unknown to Kafka as it has never been committed.

Possible Solutions

What I have seen doing some research on the web is that you could commit offsets in the callback function of the onQueryProgress method in a customized StreamingQueryListener of Spark.

As I will not claim to have developed this myself, here are the most important links that helped me to understand:

Code Example for Listener
Discussion on SO around offset management
General description on the StreamingQueryListener

来源：https://stackoverflow.com/questions/50844449/how-to-manually-set-group-id-and-commit-kafka-offsets-in-spark-structured-stream

标签

apache-spark

apache-kafka

spark-structured-streaming

spark-streaming-kafka