If the one partition is lost, we can use lineage to reconstruct it. Will the base RDD be loaded again?

前端 未结 1 1767
不知归路
不知归路 2021-01-13 01:48

I read the paper \"Resilient Distributed Datasets A Fault-Tolerant Abstraction for In-Memory Cluster Computing\". The author said that if the one partition is lost, we can u

1条回答
  •  -上瘾入骨i
    2021-01-13 02:19

    Yes, as you mentioned if the RDD that was used to create the partition is not in memory anymore it has to be loaded again from disk and recomputed. If the original RDD that was used to create your current partition also isn't there (neither in memory or on disk) then Spark will have to go one step back again and recompute the previous RDD. In the worst case scenario Spark will have to go all the way back to the original data.

    If you are having long lineage chains like the one described above as the worst case scenario that might mean long re-computation times, that's when you should consider using checkpointing which stores intermediate results in reliable storage (like HDFS) which would prevent Spark from going all the way back to the original data source and use the checkpointed data instead.

    @Comment: I'm having problems finding any official reference material but from what I remember from their codebase Spark only recreates the part of data that got lost.

    0 讨论(0)
提交回复
热议问题