Increase number of Hive mappers in Hadoop 2

后端 未结 3 753
北荒
北荒 2021-02-02 02:56

I created a HBase table from Hive and I\'m trying to do a simple aggregation on it. This is my Hive query:

from my_hbase_table 
select col1, count(1) 
group by c         


        
3条回答
  •  再見小時候
    2021-02-02 03:11

    Split the file lesser then default value is not a efficient solution. Spiting is basically used during dealing with large dataset. Default value is itself a small size so its not worth to split it again.

    I would recommend following configuration before your query.You can apply it based upon your input data.

    set hive.merge.mapfiles=false;
    
    set hive.input.format=org.apache.hadoop.hive.ql.io.HiveInputFormat;
    
    set mapred.map.tasks = XX;
    

    If you want to assign number of reducer also then you can use below configuration

    set mapred.reduce.tasks = XX;
    

    Note that on Hadoop 2 (YARN), the mapred.map.tasks and mapred.reduce.tasks are deprecated and are replaced by other variables:

    mapred.map.tasks     -->    mapreduce.job.maps
    mapred.reduce.tasks  -->    mapreduce.job.reduces
    

    Please refer below useful link related to this

    http://answers.mapr.com/questions/5336/limit-mappers-and-reducers-for-specific-job.html

    Fail to Increase Hive Mapper Tasks?

    How mappers get assigned

    Number of mappers is determined by the number of splits determined by the InputFormat used in the MapReduce job. In a typical InputFormat, it is directly proportional to the number of files and file sizes.

    suppose your HDFS block configuration is configured for 64MB(default size) and you have a files with 100MB size then it will occupy 2 block and then 2 mapper will get assigned based on the blocks

    but suppose if you have 2 files with 30MB size(each file) then each file will occupy one block and mapper will get assigend based on that.

    When you are working with a large number of small files, Hive uses CombineHiveInputFormat by default. In terms of MapReduce, it ultimately translates to using CombineFileInputFormat that creates virtual splits over multiple files, grouped by common node, rack when possible. The size of the combined split is determined by

    mapred.max.split.size
    or 
    mapreduce.input.fileinputformat.split.maxsize ( in yarn/MR2);
    

    So if you want to have less splits(less mapper) you need to set this parameter higher.

    This link can be useful to understand more on it.

    What is the default size that each Hadoop mapper will read?

    Also number of mappers and reducers are always dependent of available mapper and reducer slots of your cluster.

提交回复
热议问题