What is sequence file in hadoop?

后端 未结 1 732
悲&欢浪女
悲&欢浪女 2020-12-23 22:40

I am new to Map-reduce and I want to understand what is sequence file data input? I studied in the Hadoop book but it was hard for me to understand.

相关标签:
1条回答
  • 2020-12-23 23:18

    First we should understand what problems does the SequenceFile try to solve, and then how can SequenceFile help to solve the problems.

    In HDFS

    • SequenceFile is one of the solutions to small file problem in Hadoop.
    • Small file is significantly smaller than the HDFS block size(128MB).
    • Each file, directory, block in HDFS is represented as object and occupies 150 bytes.
    • 10 million files, would use about 3 gigabytes of memory of NameNode.
    • A billion files is not feasible.

    In MapReduce

    • Map tasks usually process a block of input at a time (using the default FileInputFormat).

    • The more the number of files is, the more number of Map task need and the job time can be much more slower.

    Small file scenarios

    • The files are pieces of a larger logical file.
    • The files are inherently small, for example, images.

    These two cases require different solutions.

    • For first one, write a program to concatenate the small files together.(see Nathan Marz’s post about a tool called the Consolidator which does exactly this)
    • For the second one, some kind of container is needed to group the files in some way.

    Solutions in Hadoop

    HAR files

    • HAR(Hadoop Archives) were introduced to alleviate the problem of lots of files putting pressure on the namenode’s memory.
    • HARs are probably best used purely for archival purposes.

    SequenceFile

    • The concept of SequenceFile is to put each small file to a larger single file.
    • For example, suppose there are 10,000 100KB files, then we can write a program to put them into a single SequenceFile like below, where you can use filename to be the key and content to be the value.


      (source: csdn.net)

    • Some benefits:

      1. A smaller number of memory needed on NameNode. Continue with the 10,000 100KB files example,
        • Before using SequenceFile, 10,000 objects occupy about 4.5MB of RAM in NameNode.
        • After using SequenceFile, 1GB SequenceFile with 8 HDFS blocks, these objects occupy about 3.6KB of RAM in NameNode.
      2. SequenceFile is splittable, so is suitable for MapReduce.
      3. SequenceFile is compression supported.
    • Supported Compressions, the file structure depends on the compression type.

      1. Uncompressed
      2. Record-Compressed: Compresses each record as it’s added to the file.
        (source: csdn.net)

      3. Block-Compressed
        (source: csdn.net)

        • Waits until data reaches block size to compress.
        • Block compression provide better compression ratio than Record compression.
        • Block compression is generally the preferred option when using SequenceFile.
        • Block here is unrelated to HDFS or filesystem block.
    0 讨论(0)
提交回复
热议问题