Apache flink on Kubernetes - Resume job if jobmanager crashes

三世轮回 提交于 2019-11-30 09:55:27

Out of the box, Flink requires a ZooKeeper cluster to recover from JobManager crashes. However, I think you can have a lightweight implementation of the HighAvailabilityServices, CompletedCheckpointStore, CheckpointIDCounter and SubmittedJobGraphStore which can bring you quite far.

Given that you have only one JobManager running at all times (not entirely sure whether K8s can guarantee this) and that you have a persistent storage location, you could implement a CompletedCheckpointStore which retrieves the completed checkpoints from the persistent storage system (e.g. reading all stored checkpoint files). Additionally, you would have a file which contains the current checkpoint id counter for CheckpointIDCounter and all the submitted job graphs for the SubmittedJobGraphStore. So the basic idea is to store everything on a persistent volume which is accessible by the single JobManager.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!