pySpark Kafka Direct Streaming update Zookeeper / Kafka Offset

前端 未结 2 643
Happy的楠姐
Happy的楠姐 2021-01-22 06:19

currently I\'m working with Kafka / Zookeeper and pySpark (1.6.0). I have successfully created a kafka consumer, which is using the KafkaUtils.createDirectStream()

2条回答
  •  佛祖请我去吃肉
    2021-01-22 06:42

    I encountered similar question. You are right, by using directStream, means using kafka low-level API directly, which didn't update reader offset. there are couple of examples for scala/java around, but not for python. but it's easy to do it by yourself, what you need to do are:

    • read from the offset at the beginning
    • save the offset at the end

    for example, I save the offset for each partition in redis by doing:

    stream.foreachRDD(lambda rdd: save_offset(rdd))
    def save_offset(rdd):
      ranges = rdd.offsetRanges()
      for rng in ranges:
         rng.untilOffset # save offset somewhere
    

    then at the begin, you can use:

    fromoffset = {}
    topic_partition = TopicAndPartition(topic, partition)
    fromoffset[topic_partition]= int(value) #the value of int read from where you store previously.
    

    for some tools that use zk to track offset, it's better to save the offset in zookeeper. this page: https://community.hortonworks.com/articles/81357/manually-resetting-offset-for-a-kafka-topic.html describe how to set the offset, basically, the zk node is: /consumers/[consumer_name]/offsets/[topic name]/[partition id] as we are using directStream, so you have to make up a consumer name.

提交回复
热议问题