I know brief about hadoop
I am curious to know how does it work.
To be precise I want to know, how exactly it divides/splits the input file.
Does it
This is dependent on the InputFormat, which for most file-based formats is defined in the FileInputFormat
base class.
There are a number of configurable options which denote how hadoop will take a single file and either process it as a single split, or divide the file into multiple splits:
InputFormat.isSplittable()
implementation for your input format for more informationmapred.min.split.size
and mapred.max.split.size
which help the input format when breaking up blocks into splits. Note that the minimum size may be overriden by the input format (which may have a fixed minumum input size)If you want to know more, and are comfortable looking through the source, check out the getSplits()
method in FileInputFormat
(both the new and old api have the same method, but they may have some suttle differences).
When you submit a map-reduce job (or pig/hive job), Hadoop first calculates the input splits, each input split size generally equals to HDFS block size. For example, for a file of 1GB size, there will be 16 input splits, if block size is 64MB. However, split size can be configured to be less/more than HDFS block size. Calculation of input splits is done with FileInputFormat. For each of these input splits, a map task must be started.
But you can change the size of input split by configuring following properties:
mapred.min.split.size: The minimum size chunk that map input should be split into.
mapred.max.split.size: The largest valid size inbytes for a file split.
dfs.block.size: The default block size for new files.
And the formula for input split is:
Math.max("mapred.min.split.size", Math.min("mapred.max.split.size", blockSize));
You can check examples here.