Does binary encoding of AVRO compress data?

北战南征 提交于 2019-12-21 07:25:26

问题


In one of our projects we are using Kafka with AVRO to transfer data across applications. Data is added to an AVRO object and object is binary encoded to write to Kafka. We use binary encoding as it is generally mentioned as a minimal representation compared to other formats.

The data is usually a JSON string and when it is saved in a file, it uses up to 10 Mb of disk. However, when the file is compressed (.zip), it uses only few KBs. We are concerned storing such data in Kafka, so trying to compress before writing to a Kafka topic.

When length of binary encoded message (i.e. length of byte array) is measured, it is proportional to the length of the data string. So I assume binary encoding is not reducing any size.

Could someone tell me if binary encoding compresses data? If not, how can I apply compression?

Thanks!


回答1:


If binary encoding compresses data?

Yes and no, it depends on your data.

According to avro binary encoding, yes for it only stores the schema once for each .avro file, regardless how many datas in that file, hence save some space w/o storing JSON's key name many times. And avro serialization do a bit compression with storing int and long leveraging variable-length zig-zag coding(only for small values). For the rest, avro don't "compress" data.

No for in some extreme case avro serialized data could be bigger than raw data. Eg. one .avro file with one Record in which only one string field. The schema overhead can defeat the saving from don't need to store the key name.

If not, how can I apply compression?

According to avro codecs, avro has built-in compression codec and optional ones. Just add one line while writing object container files :

DataFileWriter.setCodec(CodecFactory.deflateCodec(6)); // using deflate

or

DataFileWriter.setCodec(CodecFactory.snappyCodec()); // using snappy codec

To use snappy you need to include snappy-java library into your dependencies.




回答2:


If you plan to store your data on Kafka, consider using Kafka producer compression support:

ProducerConfig.set("compression.codec","snappy")

The compression is totally transparent with consumer side, all consumed messages are automatically uncompressed.



来源:https://stackoverflow.com/questions/26711256/does-binary-encoding-of-avro-compress-data

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!