Spark _temporary creation reason

后端未结

关注

 1  1574

灰色年华 2021-01-06 09:19

Why does spark, while saving result to a file system, uploads result files to a _temporary directory and then move them to output folder instead of directly uploading them t

1条回答

有刺的猬 (楼主)

2021-01-06 09:41
Two stage process is the simplest way to ensure consistency of the final result when working with file systems.

You have to remember that each executor thread writes its result set independent of the other threads and writes can be performed at different moments in time or even reuse the same set of resources. At the moment of write Spark cannot determine if all writes will succeed.
- In case of failure one can rollback the changes by removing temporary directory.
- In case of success one can commit the changes by moving temporary directory.
Another benefit of this model is clear distinction between writes in progress and finalized output. As a result it can easily integrated with simple workflow management tools, without a need of having a separate state store or other synchronization mechanism.

This model is simple, reliable and works well with file systems for which it has been designed. Unfortunately it doesn't perform that well with object stores, which don't support moves.
0 讨论(0)
发布评论:

提交评论
- 加载中...