Tips to improve MapReduce Job performance in Hadoop

时光毁灭记忆、已成空白 提交于 2019-12-18 07:23:34

问题


I have 100 mapper and 1 reducer running in a job. How to improve the job performance?

As per my understanding: Use of combiner can improve the performance to great extent. But what else we need to configure to improve the jobs performance?


回答1:


With the limited data in this question ( Input file size, HDFS block size, Average map processing time, Number of Mapper slots & Reduce slots in cluster etc.), we can't suggest tips.

But there are some general guidelines to improve the performance.

  1. If each task takes less than 30-40 seconds, reduce the number of tasks
  2. If a job has more than 1TB of input, consider increasing the block size of the input dataset to 256M or even 512M so that the number of tasks will be smaller.
  3. So long as each task runs for at least 30-40 seconds, increase the number of mapper tasks to some multiple of the number of mapper slots in the cluster
  4. Number of reduce tasks per a job should be equal to or a bit less than the number of reduce slots in the cluster.

Some more tips :

  1. Configure the cluster properly with right diagnostic tools
  2. Use compression when you are writing intermediate data to disk
  3. Tune number of Map & Reduce tasks as per above tips
  4. Incorporate Combiner wherever it is appropriate
  5. Use Most appropriate data types for rendering Output ( Do not use LongWritable when range of output values are in Integer range. IntWritable is right choice in this case)
  6. Reuse Writables
  7. Have right profiling tools

Have a look at this cloudera article for some more tips.



来源:https://stackoverflow.com/questions/34241198/tips-to-improve-mapreduce-job-performance-in-hadoop

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!