Is there any memory loss in HDFS if we use small files?

前端 未结 3 1528
醉梦人生
醉梦人生 2021-01-03 15:42

I have taken below Quoting from Hadoop - The Definitive Guide: Note, however, that small files do not take up any more disk space than is required to store the raw contents

相关标签:
3条回答
  • 2021-01-03 16:19

    NameNode Memory Usage:

    Every file, directory and block in HDFS is represented as an object. i.e. each entry i the namenode is reflected to a item. in the namenode’s memory, and each of object/item occupies 150 to 200 bytes of namenode memory.memorandums prefer fewer large files as a result of the metadata that needs to be stored.

    Consider a 1 GB file with the default block size of 64MB.

    -Stored as a single file 1 GB file
      Name: 1 item
      Block=16
      Total Item = 16*3( Replication factor=3) = 48 + 1(filename) = 49
      Total NameNode memory: 150*49
    
    -Stored as 1000 individual 1 MB files
      Name: 1000
      Block=1000
      Total Item = 1000*3( Replication factor=3) = 3000 + 1000(filename) = 4000
      Total NameNode memory: 150*4000
    

    Above results clarify that large number of small files is a overhead of naemnode memory as it takes more space of NameNode memory. Block Name and Block ID is a unique ID of a particular block of data.This uniue ID is getting used to identified the block during reading of the data when client make a request to read data.Hence it can not be shared.

    HDFS is designed to handle large files. Lets say you have a 1000Mb file. With a 4k block size, you'd have to make 256,000 requests to get that file (1 request per block). In HDFS, those requests go across a network and come with a lot of overhead.

    Each request has to be processed by the Name Node to figure out where that block can be found. That's a lot of traffic! If you use 64Mb blocks, the number of requests goes down to 16, greatly reducing the cost of overhead and load on the Name Node.

    To keep these things in mind hadoop recommend large block size.

    HDFS block size is a logical unit of splitting a large file into small chunks. This chunks is basically called a block. These chunks/block is used during further parallel processing of the data.i.e. MapReduce Programming or other model to read/process of that within HDFS.

    If a file is small enough to fit in this logical block then one block will get assigned for the file and it will take disk space according to file size and Unix file system you are using.The detail about, how file gets stored in disk is available on this link.

    HDFS block size Vs actual file size

    As HDFS block size is a logical unit not a physical unit of the memory, so there is no waste of memory.

    These link will be useful to understand the problem with small file.

    Link1, Link2

    0 讨论(0)
  • 2021-01-03 16:39
    1. 1 MB file stored in 128MB block with 3 replication. Then the file will be stored in 3 blocks and uses 3*1=3 MB only not 3*128=384 MB. But it shows each the block size as 128 MB. It is just an abstraction to store the metadata in the namenode, but not an actual memory size used.

    2. No way to store more than a file in a single block. Each file will be stored in a separate block.

    Reference:

    1. https://stackoverflow.com/a/21274388/3496666
    2. https://stackoverflow.com/a/15065274/3496666
    3. https://stackoverflow.com/a/14109147/3496666
    0 讨论(0)
  • 2021-01-03 16:45
    1. See Kumar's Answer
    2. You could look into SequenceFiles or HAR Files depending on your use case. HAR files are analogous to the Tar command. MapReduce can act upon each HAR files with a little overhead. As for SequenceFiles, they are in a way a container of Key/Value pairs. The benefit of this is a Map task can act upon each of these pairs.

    HAR Files

    Sequence Files

    More About Sequence Files

    0 讨论(0)
提交回复
热议问题