BZip2 file read in Hadoop

旧巷老猫 提交于 2019-12-21 21:39:21

问题


I heard we can use multiple mappers to read different parts of one bzip2 file in parallel in Hadoop, to increase performance. But I cannot find related samples after search. Appreciate if anyone could point me to related code snippet. Thanks.

BTW: is gzip has the same feature (multiple mapper process different parts of one gzip file in parallel).


回答1:


If you look at: http://comments.gmane.org/gmane.comp.jakarta.lucene.hadoop.user/30662, you will find that bzip2 format is indeed splittable and multiple mappers can work on one file. The patch was submitted at: https://issues.apache.org/jira/browse/HADOOP-4012. However, it seems it is available only above HADOOP 0.21.0.

From personal experience in order to use this technique of bzip2 there is nothing different that you need to do. hadoop should pick it up automatically depending on your min split size.

bzip2 compressed data by blocks and therefore it is possible to decompress it in blocks and send each block to a separate mapper. However, gzip does not have such a technique and therefore this cannot be sent to different mappers.




回答2:


You can look a pbzip2 for an example of parallel bz2 compression and decompression.

There is a parallel gzip as well, pigz. It does parallel compression, but not parallel decompression. The deflate format is not suited to parallel decompression. However you can either a) prepare a special gzip stream with resets of the history, or b) you can build an index into a gzip file on the first pass. Either way, you can then read different parts in parallel, or have more efficient random access.



来源:https://stackoverflow.com/questions/14035736/bzip2-file-read-in-hadoop

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!