问题
I'm running map-reduce and my inputs are gzipped, but do not have a .gz (file name) extension.
Normally, when they do have the .gz extension, Hadoop takes care of unzipping them on the fly before passing them to the mapper. However, without the extension it doesn't do so. I can't rename my files, so I need some way of "forcing" Hadoop to unzip them, even though they do not have the .gz extension.
I tried passing the following flags to Hadoop:
step_args=[ "-jobconf", "stream.recordreader.compression=gzip", "-jobconf", "mapred.output.compress=true", "-jobconf", "mapred.output.compression.type=block", "-jobconf", "mapred.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec"]
However, the input to the mapper is still unzipped. I verified that by printing the inputs to the mapper inside the mapper code:
mapper input: ^_^@%r?T^B??\K??6^R?+F?3^D??b?^R,??!???a?^X?A??n?m?k?3id?o?z[?-?L2yt^P$n?T,^V????^??y^O^R?nno>}^B^E^N-7?^Z?'?I?OF4??-^Z^X4;????f?RH???^Z?Q??4#^W?I?^F??^]?f+???f0d??A??v?A3*????7?x?p??7?Mq?.g??{^FL?g?^Y+?6??I????^V?C??I??$??ESCVd)K??}?Z??j?,3?{ ?}v???j???^??"?.??^L?^?LX^F??p???
Any advice on how to unzip on the fly would be greatly appreciated !
Thanks! Gil.
回答1:
You need to modify the source of the LineRecordReader
class to modify how it chooses a compression codec. The default version creates a Hadoop CompressionCodecFactory
and calls getCodec
which parses a file path for its extension. You can instead use getCodecByClassName
to obtain any codec you want.
You'll then need to override your input format class to make it use your new record reader. Details here: http://daynebatten.com/2015/11/override-hadoop-compression-codec-file-extension/
来源:https://stackoverflow.com/questions/31968932/how-to-force-hadoop-to-unzip-inputs-regadless-of-their-extension