What does Spark recover the data from a failed node?

南笙酒味 提交于 2019-12-01 11:27:34

When you do rdd.persist, rdd doesn't materialize the content. It does when you perform an action on the rdd. It follows the same lazy evaluation principle.

Now an RDD knows the partition on which it should operate and the DAG associated with it. With the DAG it is perfectly capable of recreating the materialized partition.

So, when a node fails the driver spawn another executor in some other node and provides it the Data partition on which it was supposed to work and the DAG associated with it in a closure. Now with this information it can recompute the data and materialize it.

In the mean time the cached data in the RDD won't have all the data in memory, the data of the lost nodes it has to fetch from the disk it will take so little more time.

On the replication, yes spark supports in memory replication. You need to set StorageLevel.MEMORY_DISK_2 when you persist.

rdd.persist(StorageLevel.MEMORY_DISK_2)

This ensures the data is replicated twice.

I think the best way I was able to understand how Spark is resilient was when someone told me that I should not think of RDDs as big, distributed arrays of data.

Instead I should picture them as a container that had instructions on what steps to take to convert data from data source and take one step at a time until a result was produced.

Now if you really care about losing data when persisting, then you can specify that you want to replicate your cached data.

For this, you need to select storage level. So instead of normally using this:

MEMORY_ONLY - Store RDD as deserialized Java objects in the JVM. If the RDD does not fit in memory, some partitions will not be cached and will be recomputed on the fly each time they're needed. This is the default level.
MEMORY_AND_DISK - Store RDD as deserialized Java objects in the JVM. If the RDD does not fit in memory, store the partitions that don't fit on disk, and read them from there when they're needed.

You can specify that you want your persisted data replcated

MEMORY_ONLY_2, MEMORY_AND_DISK_2, etc. -    Same as the levels above, but replicate each partition on two cluster nodes.

So if the node fails, you will not have to recompute the data.

Check storage levels here: http://spark.apache.org/docs/latest/rdd-programming-guide.html#rdd-persistence

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!