How to force Hadoop to unzip inputs regadless of their extension?

问题

I'm running map-reduce and my inputs are gzipped, but do not have a .gz (file name) extension.

Normally, when they do have the .gz extension, Hadoop takes care of unzipping them on the fly before passing them to the mapper. However, without the extension it doesn't do so. I can't rename my files, so I need some way of "forcing" Hadoop to unzip them, even though they do not have the .gz extension.

I tried passing the following flags to Hadoop:

step_args=[ "-jobconf", "stream.recordreader.compression=gzip", "-jobconf", "mapred.output.compress=true", "-jobconf", "mapred.output.compression.type=block", "-jobconf", "mapred.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec"]

However, the input to the mapper is still unzipped. I verified that by printing the inputs to the mapper inside the mapper code:

mapper input: ^_^@%r?T^B??\K??6^R?+F?3^D??b?^R,??!???a?^X?A??n?m?k?3id?o?z[?-?L2yt^P$n?T,^V????^??y^O^R?nno>}^B^E^N-7?^Z?'?I?OF4??-^Z^X4;????f?RH???^Z?Q??4#^W?I?^F??^]?f+???f0d??A??v?A3*????7?x?p??7?Mq?.g??{^FL?g?^Y+?6??I????^V?C??I??$??ESCVd)K??}?Z??j?,3?{ ?}v???j???^??"?.??^L?^?LX^F??p???

Any advice on how to unzip on the fly would be greatly appreciated !

Thanks! Gil.

回答1:

You need to modify the source of the LineRecordReader class to modify how it chooses a compression codec. The default version creates a Hadoop CompressionCodecFactory and calls getCodec which parses a file path for its extension. You can instead use getCodecByClassName to obtain any codec you want.

You'll then need to override your input format class to make it use your new record reader. Details here: http://daynebatten.com/2015/11/override-hadoop-compression-codec-file-extension/

来源：https://stackoverflow.com/questions/31968932/how-to-force-hadoop-to-unzip-inputs-regadless-of-their-extension

标签

Hadoop

MapReduce

emr

elastic-map-reduce

amazon-emr