Spark Stream - 'utf8' codec can't decode bytes

这一生的挚爱 提交于 2020-01-25 09:07:05

问题


I'm fairly new to stream programming. We have Kafka stream which use Avro.

I want to connect a Kafka Stream to Spark Stream. I used bellow code.

kvs = KafkaUtils.createDirectStream(ssc, [topic], {"metadata.broker.list": brokers})
lines = kvs.map(lambda x: x[1]) 

I got bellow error.

return s.decode('utf-8') File "/usr/lib64/python2.7/encodings/utf_8.py", line 16, in decode return codecs.utf_8_decode(input, errors, True) UnicodeDecodeError: 'utf8' codec can't decode bytes in position 57-58: invalid continuation byte

Do i need to specify that Kafka use Avro, Is above error for that? If it is how I can specify it?


回答1:


Right, the problem is with deserialization of the stream. You can use confluent-kafka-python library and specify valueDecoder in :

from confluent_kafka.avro.cached_schema_registry_client import CachedSchemaRegistryClient`
from confluent_kafka.avro.serializer.message_serializer import MessageSerializer

kvs = KafkaUtils.createDirectStream(ssc, [topic], {"metadata.broker.list": brokers}, valueDecoder=MessageSerializer.decode_message)`

Credits for the solution to https://stackoverflow.com/a/49179186/6336337




回答2:


Yes you should specify it.

With java :

creation of stream :

final JavaInputDStream<ConsumerRecord<String, avroType>> stream =
                KafkaUtils.createDirectStream(
                        jssc,
                        LocationStrategies.PreferConsistent(),
                        ConsumerStrategies.Subscribe(topics, kafkaParams));

in the kafka consumer config :

kafkaParams.put("key.deserializer", org.apache.kafka.common.serialization.StringDeserializer.class);
        kafkaParams.put("value.deserializer", SpecificAvroDeserializer.class);


来源:https://stackoverflow.com/questions/52702407/spark-stream-utf8-codec-cant-decode-bytes

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!