Stream data using Spark from a partiticular partition within Kafka topics

|▌冷眼眸甩不掉的悲伤 提交于 2019-12-11 15:45:52

问题


I have already seen a similar question as clickhere

But still I want to know if streaming data from a particular partition not possible? I have used Kafka Consumer Strategies in Spark Streaming subscribe method.

ConsumerStrategies.Subscribe[String, String](topics, kafkaParams, offsets)

This is the code snippet I tried out for subscribing to topic and partition,

val topics = Array("cdc-classic")
val topic="cdc-classic"
val partition=2;
val offsets= 
Map(new TopicPartition(topic, partition) -> 2L)//I am not clear with this line, (I tried to set topic and partition number as 2)
val stream = KafkaUtils.createDirectStream[String, String](
      ssc,
      PreferConsistent,
      Subscribe[String, String](topics, kafkaParams,offsets))

But whenI run this code I get the following exception,

     Exception in thread "main" org.apache.spark.SparkException: Job aborted due to stage failure: Task 5 in stage 0.0 failed 1 times, most recent failure: Lost task 5.0 in stage 0.0 (TID 5, localhost, executor driver): org.apache.kafka.clients.consumer.OffsetOutOfRangeException: Offsets out of range with no configured reset policy for partitions: {cdc-classic-2=2}
    at org.apache.kafka.clients.consumer.internals.Fetcher.parseCompletedFetch(Fetcher.java:878)
    at org.apache.kafka.clients.consumer.internals.Fetcher.fetchedRecords(Fetcher.java:525)
    at org.apache.kafka.clients.consumer.KafkaConsumer.pollOnce(KafkaConsumer.java:1110)
    at org.apache.kafka.clients.consumer.KafkaConsumer.poll(KafkaConsumer.java:1043)
    at org.apache.spark.streaming.kafka010.CachedKafkaConsumer.poll(CachedKafkaConsumer.scala:99)
    at org.apache.spark.streaming.kafka010.CachedKafkaConsumer.get(CachedKafkaConsumer.scala:70)
Caused by: org.apache.kafka.clients.consumer.OffsetOutOfRangeException: Offsets out of range with no configured reset policy for partitions: {cdc-classic-2=2}
    at org.apache.kafka.clients.consumer.internals.Fetcher.parseCompletedFetch(Fetcher.java:878)
    at org.apache.kafka.clients.consumer.internals.Fetcher.fetchedRecords(Fetcher.java:525)
    at org.apache.kafka.clients.consumer.KafkaConsumer.pollOnce(KafkaConsumer.java:1110)
    at org.apache.kafka.clients.consumer.KafkaConsumer.poll(KafkaConsumer.java:1043)
    at org.apache.spark.streaming.kafka010.CachedKafkaConsumer.poll(CachedKafkaConsumer.scala:99)

P.S: cdc-classic is the topic name with 17 partitions


回答1:


Kafka's partition is Spark's parallelization unit. So even if technically it would be somehow possible, it doesn't make sense since all data will be processed by a single executor. Instead of using Spark for it you can simply launch your process as KafkaConsumer:

 String topic = "foo";
 TopicPartition partition0 = new TopicPartition(topic, 0);
 TopicPartition partition1 = new TopicPartition(topic, 1);
 consumer.assign(Arrays.asList(partition0, partition1));

(https://kafka.apache.org/0110/javadoc/org/apache/kafka/clients/consumer/KafkaConsumer.html)

If you want to profit of Spark automatic retries, you can simply create a Docker image with that and launch it for instance with Kubernetes with appropriate retry configuration.

Regarding to Spark, if you really want to use it, you should check what is the offset of the partition you read. Probably you provide an incorrect one and it returns you "out of range" offset message (maybe start with 0?).




回答2:


Specify the partition number and starting offset of the partition to stream data in this line ,

Map(new TopicPartition(topic, partition) -> 2L)

where,

  • partition is the partition number

  • 2L refers to the starting offset number of the partition.

Then we can stream the data from selected partitions.



来源:https://stackoverflow.com/questions/50734166/stream-data-using-spark-from-a-partiticular-partition-within-kafka-topics

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!