pySpark Kafka Direct Streaming update Zookeeper / Kafka Offset

前端未结

关注

 2  643

Happy的楠姐 2021-01-22 06:19

currently I\'m working with Kafka / Zookeeper and pySpark (1.6.0). I have successfully created a kafka consumer, which is using the KafkaUtils.createDirectStream()

2条回答

佛祖请我去吃肉 (楼主)

2021-01-22 06:42
I encountered similar question. You are right, by using directStream, means using kafka low-level API directly, which didn't update reader offset. there are couple of examples for scala/java around, but not for python. but it's easy to do it by yourself, what you need to do are:
- read from the offset at the beginning
- save the offset at the end
for example, I save the offset for each partition in redis by doing:
```
stream.foreachRDD(lambda rdd: save_offset(rdd))
def save_offset(rdd):
  ranges = rdd.offsetRanges()
  for rng in ranges:
     rng.untilOffset # save offset somewhere
```
then at the begin, you can use:
```
fromoffset = {}
topic_partition = TopicAndPartition(topic, partition)
fromoffset[topic_partition]= int(value) #the value of int read from where you store previously.
```
for some tools that use zk to track offset, it's better to save the offset in zookeeper. this page: https://community.hortonworks.com/articles/81357/manually-resetting-offset-for-a-kafka-topic.html describe how to set the offset, basically, the zk node is: /consumers/[consumer_name]/offsets/[topic name]/[partition id] as we are using directStream, so you have to make up a consumer name.
0 讨论(0)

查看其它2个回答
发布评论:

提交评论
- 加载中...