Hadoop input split for a compressed block

问题

If i have a compressed file of 1GB which is splittable and by default the block size and input split size is 128MB then there are 8 blocks created and 8 input split. When the compressed block is read by map reduce it is uncompressed and say after uncompression the size of the block becomes 200MB. But the input split for this assigned is of 128MB, so how is the rest of the 82MB processed.

Is it processed by the next input split?
Is the same input split size is increased?

回答1:

Here is my understanding:

Lets assume 1 GB compressed data = 2 GB decompressed data so you have 16 block of data, Bzip2 knows the block boundary as a bzip2 file provides a synchronization marker between blocks. So bzip2 splits data into 16 blocks and sends the data to 16 mappers. Each mapper gets decompressed data size of 1 input split size = 128 MB. (of-course if data is not exactly multiple of 128 MB, last mapper will get less data)

回答2:

I am here referring to the compressed files that can be split-table like bzip2 which is splittable. If an input split is created for 128MB block of bzip2 and during map reduce processing when this is uncompressed to 200MB, what happens?

回答3:

Total file size : 1 GB

Block size : 128 MB

Number of splits: 8

Creating a split for each block won’t work since it is impossible to start reading at an arbitrary point in the gzip stream and therefore impossible for a map task to read its split independently of the others. The gzip format uses DEFLATE to store the compressed data, and DEFLATE stores data as a series of compressed blocks. The problem is that the start of each block is not distinguished in any way. For this reason, gzip does not support splitting.

MapReduce will does not split the gzipped file, since it knows that the input is gzip-compressed (by looking at the filename extension) and that gzip does not support splitting. This will work, but at the expense of locality: a single map will process the 8 HDFS blocks, most of which will not be local to the map.

Have a look at : this article and section name: Issues about compression and input split

EDIT: ( for splittable uncompression)

BZip2 is a compression / De-Compression algorithm which does compression on blocks of data and later these compressed blocks can be decompressed independent of each other. This is indeed an opportunity that instead of one BZip2 compressed file going to one mapper, we can process chunks of file in parallel. The correctness criteria of such a processing is that for a bzip2 compressed file, each compressed block should be processed by only one mapper and ultimately all the blocks of the file should be processed. (By processing we mean the actual utilization of that un-compressed data (coming out of the codecs) in a mapper)

Source: https://issues.apache.org/jira/browse/HADOOP-4012

来源：https://stackoverflow.com/questions/33331366/hadoop-input-split-for-a-compressed-block

标签

Hadoop

input-split