Tips to improve MapReduce Job performance in Hadoop

前端 未结 1 1381
说谎
说谎 2020-12-20 09:16

I have 100 mapper and 1 reducer running in a job. How to improve the job performance?

As per my understanding: Use of combiner can improve the performance to great e

相关标签:
1条回答
  • 2020-12-20 09:56

    With the limited data in this question ( Input file size, HDFS block size, Average map processing time, Number of Mapper slots & Reduce slots in cluster etc.), we can't suggest tips.

    But there are some general guidelines to improve the performance.

    1. If each task takes less than 30-40 seconds, reduce the number of tasks
    2. If a job has more than 1TB of input, consider increasing the block size of the input dataset to 256M or even 512M so that the number of tasks will be smaller.
    3. So long as each task runs for at least 30-40 seconds, increase the number of mapper tasks to some multiple of the number of mapper slots in the cluster
    4. Number of reduce tasks per a job should be equal to or a bit less than the number of reduce slots in the cluster.

    Some more tips :

    1. Configure the cluster properly with right diagnostic tools
    2. Use compression when you are writing intermediate data to disk
    3. Tune number of Map & Reduce tasks as per above tips
    4. Incorporate Combiner wherever it is appropriate
    5. Use Most appropriate data types for rendering Output ( Do not use LongWritable when range of output values are in Integer range. IntWritable is right choice in this case)
    6. Reuse Writables
    7. Have right profiling tools

    Have a look at this cloudera article for some more tips.

    0 讨论(0)
提交回复
热议问题