Can somebody explain this calculation and give a lucid explanation?
A quick calculation shows that if the seek time is around 10 ms and the transfer r
Since 100mb is divided into 10 blocks you gotta do 10 seeks and transfer rate is (10/100)mb/s for each file. (10ms*10) + (10/100mb/s)*10 = 1.1 sec. which is greater than 1.01 anyway.
Since 100mb is divided among 10 blocks, each block has 10mb only as it is HDFS. Then it should be 10*10ms + 10mb/(100Mb/s)
= 0.1s+ 0.1s
= 0.2s
and even lesser time.
A block will be stored as a contiguous piece of information on the disk, which means that the total time to read it completely is the time to locate it (seek time) + the time to read its content without doing any more seeks, i.e. sizeOfTheBlock / transferRate = transferTime
.
If we keep the ratio seekTime / transferTime
small (close to .01 in the text), it means we are reading data from the disk almost as fast as the physical limit imposed by the disk, with minimal time spent looking for information.
This is important since in map reduce jobs we are typically traversing (reading) the whole data set (represented by an HDFS file or folder or set of folders) and doing logic on it, so since we have to spend the full transferTime
anyway to get all the data out of the disk, let's try to minimise the time spent doing seeks and read by big chunks, hence the large size of the data blocks.
In more traditional disk access software, we typically do not read the whole data set every time, so we'd rather spend more time doing plenty of seeks on smaller blocks rather than losing time transferring too much data that we won't need.