Why can't hadoop split up a large text file and then compress the splits using gzip?

后端 未结 2 1560
独厮守ぢ
独厮守ぢ 2020-12-17 01:01

I\'ve recently been looking into hadoop and HDFS. When you load a file into HDFS, it will normally split the file into 64MB chunks and distribute these chunks around your cl

相关标签:
2条回答
  • 2020-12-17 01:20

    The HDFS has a limited scope of being only a distributed file-system service and doesn't do heavy-lifting operations such as compressing the data. The actual process of data compression is delegated to distributed execution frameworks like Map-Reduce, Spark, Tez etc. So compression of data/files is the concern of the execution framework and not that of the File System.

    Additionally the presence of container file formats like Sequence-file, Parquet etc negates the need of HDFS to compress the Data blocks automatically as suggested by the question.

    So to summarize due to design philosophy reasons any compression of data must be done by the execution engine not by the file system service.

    0 讨论(0)
  • 2020-12-17 01:29

    The simple reason is the design principle of "separation of concerns".

    If you do what you propose then HDFS must know what the actual bits and bytes of the file mean. Also HDFS must be made able to reason about it (i.e. extract, decompress, etc.). In general you don't want this kind of mixing up responsibilities in software.

    So the 'only' part that is to understand what the bits mean is the application that must be able to read it: which is commonly written using the MapReduce part of Hadoop.

    As stated in the Javadoc of HADOOP-7076 (I wrote that thing ;) ):

    Always remember that there are alternative approaches:

    • Decompress the original gzipped file, split it into pieces and recompress the pieces before offering them to Hadoop.
      For example: Splitting gzipped logfiles without storing the ungzipped splits on disk
    • Decompress the original gzipped file and compress using a different splittable codec. For example BZip2Codec or not compressing at all.

    HTH

    0 讨论(0)
提交回复
热议问题