Should the entire cluster be restarted if a single Task Manager crashes?

问题

We're running a standalone Flink cluster with 2 Job Managers and 3 Task Managers. Whenever a TM crashes, we simply restart that particular TM and proceed with the processing.

But reading the comments on this question makes it look like we need to restart all the 5 nodes that form a cluster to deal with the failure of a single TM. Am I reading this right? What would be the consequences if we restart just the crashed TM and let the healthy ones run as is?

回答1:

Sorry if I my answer elsewhere was unclear, but what you are doing is fine. Perhaps it would be more accurate to say that the job is being "reset", which happens automatically. Since checkpoints are globally consistent, it's important that all of the taskmanagers rewind and restart processing from the state recorded in the checkpoint, but Flink takes care of this for you (once the necessary resources are again made available).

来源：https://stackoverflow.com/questions/54251385/should-the-entire-cluster-be-restarted-if-a-single-task-manager-crashes

标签

apache-flink

flink-streaming

易学教程内所有资源均来自网络或用户发布的内容，如有违反法律规定的内容欢迎反馈！
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!