I explicitly specify the number of mappers within my java program using conf.setNumMapTasks()
, but when the job ends, the counter shows that the number of launched
According to the Hadoop API Jonf.setNumMapTasks is just a hint to the Hadoop runtime. The total number of map tasks equals to the number of blocks in the input data to be processed.
Although, it should be possible to configure the number of map/reduce slots per node by using the mapred.tasktracker.map.tasks.maximum
and the mapred.tasktracker.reduce.tasks.maximum
in mapred-site.xml. This way it's possible to configure the total number of mappers/reducers executing in parallel across the entire cluster.
Using conf.setNumMapTasks(int num)
the number of mappers can be increased but cannot be reduced.
You cannot set number of mappers explicitly to a certain number which is less than the number of mappers calculated by Hadoop. This is decided by the number of Input Splits created by hadoop for your given set of input. You may control this by setting mapred.min.split.size parameter
.
To quote from the wiki page:
The number of maps is usually driven by the number of DFS blocks in the input files. Although that causes people to adjust their DFS block size to adjust the number of maps. The right level of parallelism for maps seems to be around 10-100 maps/node, although we have taken it up to 300 or so for very cpu-light map tasks. Task setup takes awhile, so it is best if the maps take at least a minute to execute.
Actually controlling the number of maps is subtle. The mapred.map.tasks parameter is just a hint to the InputFormat for the number of maps. The default InputFormat behavior is to split the total number of bytes into the right number of fragments. However, in the default case the DFS block size of the input files is treated as an upper bound for input splits. A lower bound on the split size can be set via mapred.min.split.size. Thus, if you expect 10TB of input data and have 128MB DFS blocks, you'll end up with 82k maps, unless your mapred.map.tasks is even larger. Ultimately the InputFormat determines the number of maps.
The number of map tasks can also be increased manually using the JobConf's conf.setNumMapTasks(int num). This can be used to increase the number of map tasks, but will not set the number below that which Hadoop determines via splitting the input data.
Quoting the javadoc of JobConf#setNumMapTasks():
Note: This is only a hint to the framework. The actual number of spawned map tasks depends on the number of
InputSplit
s generated by the job'sInputFormat.getSplits(JobConf, int)
. A customInputFormat
is typically used to accurately control the number of map tasks for the job.
Hadoop also relaunches failed or long running map tasks in order to provide high availability.
You can limit the number of map tasks concurrently running on a single node. And you could limit the number of launched tasks provided that you have big input files. You would have to write an own InputFormat
class, which is not splitable. Then Hadoop will run a map task for every input file, that you have.
According to [Partitioning your job into maps and reduces], follows:
The mapred.map.tasks parameter is just a hint to the InputFormat for the number of maps. The default InputFormat behavior is to split the total number of bytes into the right number of fragments. However, in the default case the DFS block size of the input files is treated as an upper bound for input splits. A lower bound on the split size can be set via mapred.min.split.size. Thus, if you expect 10TB of input data and have 128MB DFS blocks, you'll end up with 82k maps, unless your mapred.map.tasks is even larger. Ultimately the InputFormat determines the number of maps.
However, you can learn more about InputFormat .