reuse JVM in Hadoop mapreduce jobs

前端 未结 2 1873
独厮守ぢ
独厮守ぢ 2020-12-13 21:41

I know we can set the property \"mapred.job.reuse.jvm.num.tasks\" to re-use JVM. My questions are:

(1) how to decide the number of tasks to be set here, -1 or some o

相关标签:
2条回答
  • 2020-12-13 22:00

    JVM reuse(only possible in MR1) should help with performance because it removes the startup lag of the JVM but it is only marginal and comes with a number of drawbacks(read side effects. Most tasks will run for a long time (tens of seconds or even minutes) and startup times are not the problem when you look at those task run times. You would like to start a new task on a clean slate. When you re-use a JVM there is a chance that the heap is not completely clean(it is fragmented from the previous runs). The fragmentation can lead to more GC's and nullify all the start up time gains. If there is a memory leak it could also affect the memory usage etc. So it's better to start a new JVM for the tasks(if the tasks are not reasonably small). In MR2(YARN) - new JVM is always started for the tasks. For Uber tasks - it will run the task in the local JVM only.

    0 讨论(0)
  • 2020-12-13 22:16

    If you have very small tasks that are definitely running after each other, it is useful to set this property to -1 (meaning that a spawned JVM will be reused unlimited times). So you just spawn (number of task in your cluster available to your job)-JVMs instead of (number of tasks)-JVMs.

    This is a huge performance improvement. In long running jobs the percentage of the runtime in comparision to setup a new JVM is very low, so it doesn't give you a huge performance boost.

    Also in long running tasks it is good to recreate the task process, because of issues like heap fragmentation degrading your performance.

    In addition, if you have some mid-time-running jobs, you could reuse just 2-3 of the tasks, having a good trade-off.

    0 讨论(0)
提交回复
热议问题