Hive query shows few reducers killed but query is still running. Will the output be proper?

问题

I have a complex query with multiple left outer joins running for the last 1 hour in Amazon AWS EMR. But few reducers are shown as Failed and Killed.

My question is why do some reducers get killed? Will the final output be proper?

回答1:

Usually each container has 3 attempts before final fail (configurable, as @rbyndoor mentioned). If one attempt has failed, it is being restarted until the number of attempts reaches limit, and if it is failed, the whole vertex is failed, all other tasks being killed.

Rare failures of some task attempts is not so critical issue, especially when running on EMR cluster with spot nodes, which can be removed during execution, causing failures and partial restarts of some vertices.

In most cases the reason of failures you can find in tracker logs.

And of course this is not the reason to switch to the deprecated MR. Try to find what is the root cause and fix it.

In some marginal cases when even if the job with some failed attempts succeeded, the data produced may be partially corrupted. For example when using some non-deterministic function in the distribute by clause. Like rand(). In this case restarted container may try to copy data produced by previous step (mapper), and the spot node with mapper results is already removed. In such case some previous step containers are restarted, but the data produced may be different because of non-deterministic nature of rand function.

About killed tasks.

Mappers or reducers can be killed because of many reasons. First of all when one of the containers has failed completely, all other tasks running are being killed. If speculative execution is switched on, duplicated tasks are killed, if the task is not responding for a long time, etc. This is quite normal and usually is not an indicator that something is wrong. If the whole job has failed or you have many attempts failures, you need to inspect failed tasks logs to find the reason, not killed ones.

回答2:

There can be a lot of reasons for the reducers to be killed. Some of them are :

Low staging area memory.
Resource unavailability or deadlock.
Limit on the number of reducers to be spawned by a task. etc.

Generally, if a reducer gets killed it is restarted on its own and the job is completed, there will be no data loss. But if the reducers are getting killed again and again and your job is in a stuck state because of that then you might have to look at the yarn logs in order to get to a resolution.

Also, it seems like you are running hive in TEZ mode try running in MR mode, might help.

回答3:

Short answer: Yes, If your job completes successfully then you will see right result.

There can be many reasons for a runtime failure of task. Mainly due to resources. It can be either cpu/disk/memory.

Tez AppMaster is responsible for dealing with transient container execution failures and must respond to RM requests regarding allocated and possibly deallocated Containers.

Tez AppMaster tries to reassign the task on some other containers with the constraints

tez.maxtaskfailures.per.node default=3 To make sure same node will not be used for reassigning.
tez.am.task.max.failed.attempts default=4 The maximum number of attempts that can fail for a particular task before the task is failed. This does not count killed attempts. 4 Task failure results in DAG failure

来源：https://stackoverflow.com/questions/58499316/hive-query-shows-few-reducers-killed-but-query-is-still-running-will-the-output

标签

Hadoop

Hive

amazon-emr