Spark Encoders: when to use beans()

巧了我就是萌 提交于 2019-12-24 11:16:13

问题


I came across a memory management problem while using Spark's caching mechanism. I am currently utilizing Encoders with Kryo and was wondering if switching to beans would help me reduce the size of my cached dataset.

Basically, what are the pros and cons of using beans over Kryo serialization when working with Encoders? Are there any performance improvements? Is there a way to compress a cached Dataset apart from caching with SER option?

For the record, I have found a similar topic that tackles the comparison between the two. However, it doesn't go into the details of this comparison.


回答1:


Whenever you can. Unlike generic binary Encoders, which use general purpose binary serialization and store whole objects as opaque blobs, Encoders.bean[T] leverages the structure of an object, to provide class specific storage layout.

This difference becomes obvious when you compare the schemas created using Encoders.bean and Encoders.kryo.

Why does it matter?

  • You get efficient field access using SQL API without any need for deserialization and full support for all Dataset transformations.
  • With transparent field serialization you can fully utilize columnar storage, including built-in compression.

So when to use kryo Encoder? In general when nothing else works. Personally I would avoid it completely for data serialization. The only really useful application I can think of is serialization of aggregation buffer (check for example How to find mean of grouped Vector columns in Spark SQL?).



来源:https://stackoverflow.com/questions/51370288/spark-encoders-when-to-use-beans

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!