how many mappers and reduces will get created for a partitoned table in hive

回眸只為那壹抹淺笑 提交于 2019-11-26 23:07:06
leftjoin

Mappers:

Number of mappers depends on various factors such as how the data is distributed among nodes, input format, execution engine and configuration params. See also here: https://cwiki.apache.org/confluence/display/TEZ/How+initial+task+parallelism+works

MR uses CombineInputFormat, while Tez uses grouped splits.

Tez:

set tez.grouping.min-size=16777216; -- 16 MB min split
set tez.grouping.max-size=1073741824; -- 1 GB max split

MapReduce:

set mapreduce.input.fileinputformat.split.minsize=16777216; -- 16 MB
set mapreduce.input.fileinputformat.split.minsize=1073741824; -- 1 GB

Also Mappers are running on data nodes where the data is located, that is why manually controlling the number of mappers is not an easy task, not always possible to combine input.

Reducers: Controlling the number of reducers is much easier. The number of reducers determined according to

mapreduce.job.reduces - The default number of reduce tasks per job. Typically set to a prime close to the number of available hosts. Ignored when mapred.job.tracker is "local". Hadoop set this to 1 by default, whereas Hive uses -1 as its default value. By setting this property to -1, Hive will automatically figure out what should be the number of reducers.

hive.exec.reducers.bytes.per.reducer - The default in Hive 0.14.0 and earlier is 1 GB.

Also hive.exec.reducers.max - Maximum number of reducers that will be used. If mapreduce.job.reduces is negative, Hive will use this as the maximum number of reducers when automatically determining the number of reducers.

So, if you want to increase reducers parallelism, increase hive.exec.reducers.max and decrease hive.exec.reducers.bytes.per.reducer

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!