This might seem as a silly question but in Hadoop suppose blocksize is X (typically 64 or 128 MB) and a local filesize is Y (where Y is less than X).Now when I copy file Y to th
One block is consumed by Hadoop. That does not mean that storage capacity will be consumed in an equivalent manner.
The output while browsing the HDFS from web looks like this:
filename1 file 48.11 KB 3 128 MB 2012-04-24 18:36
filename2 file 533.24 KB 3 128 MB 2012-04-24 18:36
filename3 file 303.65 KB 3 128 MB 2012-04-24 18:37
You see that each file size is lesser than the block size which is 128 MB. These files are in KB. HDFS capacity is consumed based on the actual file size but a block is consumed per file.
There are limited number of blocks available dependent on the capacity of the HDFS. You are wasting blocks as you will run out of them before utilizing all the actual storage capacity. Remember that Unix filsystem also has concept of blocksize but is a very small number around 512 Bytes. This concept is inverted in HDFS where the block size is kept bigger around 64-128 MB.
The other issue is that when you run map/reduce programs it will try to spawn mapper per block so in this case when you are processing three small files, it may end up spawning three mappers to work on them eventually. This wastes resources when the files are of smaller size. You also add latency as each mapper takes time to spawn and then ultimately would work on a very small sized file. You have to compact them into files closer to blocksize to take advantage of mappers working on lesser number of files.
Yet another issue with numerous small files is that it loads namenode which keeps the mapping (metadata) of each block and chunk mapping in main memory. With smaller files, you fill up this table faster and more main memory will be required as metadata grows.
Read the following for reference: