问题
We're running a standalone Flink cluster with 2 Job Managers and 3 Task Managers. Whenever a TM crashes, we simply restart that particular TM and proceed with the processing.
But reading the comments on this question makes it look like we need to restart all the 5 nodes that form a cluster to deal with the failure of a single TM. Am I reading this right? What would be the consequences if we restart just the crashed TM and let the healthy ones run as is?
回答1:
Sorry if I my answer elsewhere was unclear, but what you are doing is fine. Perhaps it would be more accurate to say that the job is being "reset", which happens automatically. Since checkpoints are globally consistent, it's important that all of the taskmanagers rewind and restart processing from the state recorded in the checkpoint, but Flink takes care of this for you (once the necessary resources are again made available).
来源:https://stackoverflow.com/questions/54251385/should-the-entire-cluster-be-restarted-if-a-single-task-manager-crashes