What is sequence file in hadoop?

后端未结

关注

 1  732

I am new to Map-reduce and I want to understand what is sequence file data input? I studied in the Hadoop book but it was hard for me to understand.

相关标签:

1条回答

闹比i

2020-12-23 23:18
First we should understand what problems does the SequenceFile try to solve, and then how can SequenceFile help to solve the problems.

In HDFS
- SequenceFile is one of the solutions to small file problem in Hadoop.
- Small file is significantly smaller than the HDFS block size(128MB).
- Each file, directory, block in HDFS is represented as object and occupies 150 bytes.
- 10 million files, would use about 3 gigabytes of memory of NameNode.
- A billion files is not feasible.
In MapReduce
- Map tasks usually process a block of input at a time (using the default FileInputFormat).
- The more the number of files is, the more number of Map task need and the job time can be much more slower.
Small file scenarios
- The files are pieces of a larger logical file.
- The files are inherently small, for example, images.
These two cases require different solutions.
- For first one, write a program to concatenate the small files together.(see Nathan Marz’s post about a tool called the Consolidator which does exactly this)
- For the second one, some kind of container is needed to group the files in some way.
Solutions in Hadoop

HAR files
- HAR(Hadoop Archives) were introduced to alleviate the problem of lots of files putting pressure on the namenode’s memory.
- HARs are probably best used purely for archival purposes.
SequenceFile
- The concept of SequenceFile is to put each small file to a larger single file.
- For example, suppose there are 10,000 100KB files, then we can write a program to put them into a single SequenceFile like below, where you can use filename to be the key and content to be the value.
  
  _{(source: csdn.net)}
- Some benefits:
  1. A smaller number of memory needed on NameNode. Continue with the 10,000 100KB files example,
    - Before using SequenceFile, 10,000 objects occupy about 4.5MB of RAM in NameNode.
    - After using SequenceFile, 1GB SequenceFile with 8 HDFS blocks, these objects occupy about 3.6KB of RAM in NameNode.
  2. SequenceFile is splittable, so is suitable for MapReduce.
  3. SequenceFile is compression supported.
- Supported Compressions, the file structure depends on the compression type.
  1. Uncompressed
  2. Record-Compressed: Compresses each record as it’s added to the file.
    _{(source: csdn.net)}
  3. Block-Compressed
    _{(source: csdn.net)}
    - Waits until data reaches block size to compress.
    - Block compression provide better compression ratio than Record compression.
    - Block compression is generally the preferred option when using SequenceFile.
    - Block here is unrelated to HDFS or filesystem block.
0 讨论(0)
发布评论:

提交评论
- 加载中...

What is sequence file in hadoop?

In HDFS

In MapReduce

Small file scenarios

Solutions in Hadoop