Stream data using Spark from a partiticular partition within Kafka topics

问题

I have already seen a similar question as clickhere

But still I want to know if streaming data from a particular partition not possible? I have used Kafka Consumer Strategies in Spark Streaming subscribe method.

ConsumerStrategies.Subscribe[String, String](topics, kafkaParams, offsets)

This is the code snippet I tried out for subscribing to topic and partition,

val topics = Array("cdc-classic")
val topic="cdc-classic"
val partition=2;
val offsets= 
Map(new TopicPartition(topic, partition) -> 2L)//I am not clear with this line, (I tried to set topic and partition number as 2)
val stream = KafkaUtils.createDirectStream[String, String](
      ssc,
      PreferConsistent,
      Subscribe[String, String](topics, kafkaParams,offsets))

But whenI run this code I get the following exception,

     Exception in thread "main" org.apache.spark.SparkException: Job aborted due to stage failure: Task 5 in stage 0.0 failed 1 times, most recent failure: Lost task 5.0 in stage 0.0 (TID 5, localhost, executor driver): org.apache.kafka.clients.consumer.OffsetOutOfRangeException: Offsets out of range with no configured reset policy for partitions: {cdc-classic-2=2}
    at org.apache.kafka.clients.consumer.internals.Fetcher.parseCompletedFetch(Fetcher.java:878)
    at org.apache.kafka.clients.consumer.internals.Fetcher.fetchedRecords(Fetcher.java:525)
    at org.apache.kafka.clients.consumer.KafkaConsumer.pollOnce(KafkaConsumer.java:1110)
    at org.apache.kafka.clients.consumer.KafkaConsumer.poll(KafkaConsumer.java:1043)
    at org.apache.spark.streaming.kafka010.CachedKafkaConsumer.poll(CachedKafkaConsumer.scala:99)
    at org.apache.spark.streaming.kafka010.CachedKafkaConsumer.get(CachedKafkaConsumer.scala:70)
Caused by: org.apache.kafka.clients.consumer.OffsetOutOfRangeException: Offsets out of range with no configured reset policy for partitions: {cdc-classic-2=2}
    at org.apache.kafka.clients.consumer.internals.Fetcher.parseCompletedFetch(Fetcher.java:878)
    at org.apache.kafka.clients.consumer.internals.Fetcher.fetchedRecords(Fetcher.java:525)
    at org.apache.kafka.clients.consumer.KafkaConsumer.pollOnce(KafkaConsumer.java:1110)
    at org.apache.kafka.clients.consumer.KafkaConsumer.poll(KafkaConsumer.java:1043)
    at org.apache.spark.streaming.kafka010.CachedKafkaConsumer.poll(CachedKafkaConsumer.scala:99)

P.S: cdc-classic is the topic name with 17 partitions

回答1:

Kafka's partition is Spark's parallelization unit. So even if technically it would be somehow possible, it doesn't make sense since all data will be processed by a single executor. Instead of using Spark for it you can simply launch your process as KafkaConsumer:

 String topic = "foo";
 TopicPartition partition0 = new TopicPartition(topic, 0);
 TopicPartition partition1 = new TopicPartition(topic, 1);
 consumer.assign(Arrays.asList(partition0, partition1));

(https://kafka.apache.org/0110/javadoc/org/apache/kafka/clients/consumer/KafkaConsumer.html)

If you want to profit of Spark automatic retries, you can simply create a Docker image with that and launch it for instance with Kubernetes with appropriate retry configuration.

Regarding to Spark, if you really want to use it, you should check what is the offset of the partition you read. Probably you provide an incorrect one and it returns you "out of range" offset message (maybe start with 0?).

回答2:

Specify the partition number and starting offset of the partition to stream data in this line ,

Map(new TopicPartition(topic, partition) -> 2L)

where,

partition is the partition number
2L refers to the starting offset number of the partition.

Then we can stream the data from selected partitions.

来源：https://stackoverflow.com/questions/50734166/stream-data-using-spark-from-a-partiticular-partition-within-kafka-topics

标签

apache-spark

apache-kafka

apache-spark-sql

spark-streaming

kafka-consumer-api