Tips to improve MapReduce Job performance in Hadoop

问题

I have 100 mapper and 1 reducer running in a job. How to improve the job performance?

As per my understanding: Use of combiner can improve the performance to great extent. But what else we need to configure to improve the jobs performance?

回答1:

With the limited data in this question ( Input file size, HDFS block size, Average map processing time, Number of Mapper slots & Reduce slots in cluster etc.), we can't suggest tips.

But there are some general guidelines to improve the performance.

If each task takes less than 30-40 seconds, reduce the number of tasks
If a job has more than 1TB of input, consider increasing the block size of the input dataset to 256M or even 512M so that the number of tasks will be smaller.
So long as each task runs for at least 30-40 seconds, increase the number of mapper tasks to some multiple of the number of mapper slots in the cluster
Number of reduce tasks per a job should be equal to or a bit less than the number of reduce slots in the cluster.

Some more tips :

Configure the cluster properly with right diagnostic tools
Use compression when you are writing intermediate data to disk
Tune number of Map & Reduce tasks as per above tips
Incorporate Combiner wherever it is appropriate
Use Most appropriate data types for rendering Output ( Do not use LongWritable when range of output values are in Integer range. IntWritable is right choice in this case)
Reuse Writables
Have right profiling tools

Have a look at this cloudera article for some more tips.

来源：https://stackoverflow.com/questions/34241198/tips-to-improve-mapreduce-job-performance-in-hadoop

标签

performance

Hadoop

MapReduce

hadoop2