I\'ve been strugling with distcp for several days and I swear I have googled enough. Here is my use-case:
I have a main folder in a certain locati
YARN containers are built on top of Linux "cgroups". These "cgroups" are used to put soft limits on CPU, but not on RAM...
Therefore YARN uses a clumsy workaround: it periodically checks how much RAM each container uses, and kills brutally anything that got over quota. So you lose the execution logs, and only get that dreadful message you have seen.
In most cases, you are running some kind of JVM binary (i.e. a Java/Scala utility or custom program) so you can get away by setting your own JVM quotas (especially -Xmx
) so that you always stay under the YARN limit. Which means some wasted RAM because of the safety margin. But then the worse case is an clean failure of the JVM when it's out of memory, you get the execution logs in extenso and can start adjusting the quotas -- or fixing your memory leaks :-/
So what happens in your specific case? You are using Oozie to start a shell -- then the shell starts a hadoop
command, which runs in a JVM. It is on that embedded JVM that you must set the Max Heap Size.
oozie.launcher.mapreduce.map.memory.mb
) then you must ensure that the Java commands inside the shell do not consume more than, say, 28GB of Heap (to stay on the safe side).
If you are lucky, setting a single env variable will do the trick:
export HADOOP_OPTS=-Xmx28G
hadoop distcp ...........
If you are not lucky, you will have to unwrap the whole mess of hadoop-env.sh
mixing different env variables with different settings (set by people that visibly hate you, in init scripts that you cannot even know about) to be interpreted by the JVM using complex precedence rules. Have fun. You may peek at that very old post for hints about where to dig.