问题
I have 100 mapper and 1 reducer running in a job. How to improve the job performance?
As per my understanding: Use of combiner can improve the performance to great extent. But what else we need to configure to improve the jobs performance?
回答1:
With the limited data in this question ( Input file size, HDFS block size, Average map processing time, Number of Mapper slots & Reduce slots in cluster etc.), we can't suggest tips.
But there are some general guidelines to improve the performance.
- If each task takes less than 30-40 seconds, reduce the number of tasks
- If a job has more than 1TB of input, consider increasing the block size of the input dataset to 256M or even 512M so that the number of tasks will be smaller.
- So long as each task runs for at least 30-40 seconds, increase the number of mapper tasks to some multiple of the number of mapper slots in the cluster
- Number of reduce tasks per a job should be equal to or a bit less than the number of reduce slots in the cluster.
Some more tips :
- Configure the cluster properly with right diagnostic tools
- Use compression when you are writing intermediate data to disk
- Tune number of Map & Reduce tasks as per above tips
- Incorporate Combiner wherever it is appropriate
- Use Most appropriate data types for rendering Output ( Do not use
LongWritable
when range of output values are inInteger
range.IntWritable
is right choice in this case) - Reuse
Writables
- Have right profiling tools
Have a look at this cloudera article for some more tips.
来源:https://stackoverflow.com/questions/34241198/tips-to-improve-mapreduce-job-performance-in-hadoop