When I store many small files into HDFS, will they get stored in a single block?
In my opinion, these small files should get stored into a single block according to
each block belongs to only one file,just do like below: 1.use fsck command to get block info of file:
hadoop fsck /gavial/data/OB/AIR/PM25/201709/01/15_00.json -files -blocks
out put like this:
/gavial/data/OB/AIR/PM25/201709/01/15_00.json 521340 bytes, 1 block(s): OK
0. BP-1004679263-192.168.130.151-1485326068364:blk_1074920015_1179253 len=521340 repl=3
Status: HEALTHY
Total size: 521340 B
Total dirs: 0
Total files: 1
Total symlinks: 0
Total blocks (validated): 1 (avg. block size 521340 B)
Minimally replicated blocks: 1 (100.0 %)
Over-replicated blocks: 0 (0.0 %)
block id is
blk_1074920015
2.use fsck command to show block status,out put like this
hdfs fsck -blockId blk_1074920015
Block Id: blk_1074920015
Block belongs to: /gavial/data/OB/AIR/PM25/201709/01/15_00.json
No. of Expected Replica: 3
No. of live Replica: 3
No. of excess Replica: 0
No. of stale Replica: 0
No. of decommission Replica: 0
No. of corrupted Replica: 0
Block replica on datanode/rack: datanode-5/default-rack is HEALTHY
Block replica on datanode/rack: datanode-1/default-rack is HEALTHY
obviously,the block belongs to only one file
Quoting from Hadoop - The Definitive Guide:
HDFS stores small files inefficiently, since each file is stored in a block, and block metadata is held in memory by the namenode. Thus, a large number of small files can eat up a lot of memory on the namenode. (Note, however, that small files do not take up any more disk space than is required to store the raw contents of the file. For example, a 1 MB file stored with a block size of 128 MB uses 1 MB of disk space, not 128 MB.) Hadoop Archives, or HAR files, are a file archiving facility that packs files into HDFS blocks more efficiently, thereby reducing namenode memory usage while still allowing transparent access to files.
Conclusion: Each file will get stored in a separate block.
Yes. when you store large number of small files, they get stored in a single block until the block has equal space to accommodate. But the inefficiency comes because for each of these small files, there will be an indexing entry(filename,block,offset) gets created in the namenode for each small file. This wastes up the memory reserved for metadata in the namenode if we have many small files instead of small number of very large files.
Below is what specified in Hadoop Definitive Guide:
Unlike a filesystem for a single disk, a file in HDFS that is smaller than a single block does not occupy a full block’s worth of underlying storage
For example, if you have 30 MB file and your block size is of 64 MB, then this file will get stored in one block logically, but in the physical file system, HDFS uses only 30 MB to store the file. The remaining 30 MB will be free to use.