问题
How are software/hardware failures handled in YARN? Specifically, what happens in case of container(s) failure/crash?
回答1:
- Container and task failures are handled by node-manager. When a container fails or dies, node-manager detects the failure event and launches a new container to replace the failing container and restart the task execution in the new container.
- In the event of application-master failure, the resource-manager detects the failure and start a new instance of the application-master with a new container.
Find the details here
回答2:
- App master will re-attempt task that complete with exception or stop responding ( 4 time by default ) _ Job with two many failed task are considered as failed job.
来源:https://stackoverflow.com/questions/30694747/how-container-failure-is-handled-for-a-yarn-mapreduce-job