When to use Kryo serialization in Spark?

后端 未结 3 1821
梦谈多话
梦谈多话 2021-02-19 13:17

I am already compressing RDDs using conf.set(\"spark.rdd.compress\",\"true\") and persist(MEMORY_AND_DISK_SER). Will using Kryo serialization make the

3条回答
  •  粉色の甜心
    2021-02-19 13:39

    Both of the RDD states you described (compressed and persisted) use serialization. When you persist an RDD, you are serializing it and saving it to disk (in your case, compressing the serialized output as well). You are right that serialization is also used for shuffles (sending data between nodes): any time data needs to leave a JVM, whether it's going to local disk or through the network, it needs to be serialized.

    Kryo is a significantly optimized serializer, and performs better than the standard java serializer for just about everything. In your case, you may actually be using Kryo already. You can check your spark configuration parameter:

    "spark.serializer" should be "org.apache.spark.serializer.KryoSerializer".

    If it's not, then you can set this internally with:

    conf.set( "spark.serializer", "org.apache.spark.serializer.KryoSerializer" )
    

    Regarding your last question ("is it even needed?"), it's hard to make a general claim about that. Kryo optimizes one of the slow steps in communicating data, but it's entirely possible that in your use case, others are holding you back. But there's no downside to trying Kryo and benchmarking the difference!

提交回复
热议问题