问题
I want to output gzip
and lzo
formats at the same time in one job.
I used MultipleOutputs
, and add two named outputs like this:
MultipleOutputs.addNamedOutput(job, "LzoOutput", GBKTextOutputFormat.class, Text.class, Text.class);
GBKTextOutputFormat.setOutputCompressorClass(job, LzoCodec.class);
MultipleOutputs.addNamedOutput(job, "GzOutput", TextOutputFormat.class, Text.class, Text.class);
TextOutputFormat.setOutputCompressorClass(job, GzipCodec.class);
(GBKTextOutputFormat
here is written by myself which extends FileOutputFormat
)
They are used in reducer like:
multipleOutputs.write("LzoOutput", NullWritable.get(), value, "/user/hadoop/lzo/"+key.toString());
multipleOutputs.write("GzOutput", NullWritable.get(), value, "/user/hadoop/gzip/"+key.toString());
The result is:
I can get outputs in the two path, but they are both in gzip
format.
Someone can help me? Thanks!
==========================================================================
More:
I just looked at the source code of setOutputCompressorClass
in FileOutputFormat
, in which conf.setClass("mapred.output.compression.codec", codecClass, CompressionCodec.class);
It seems that mapred.output.compression.codec in configuration will be reset when setOutputCompressorClass is called.
So the actual compression format is the one we set at last, and we cannot set two different compression formats in the same job ? Or there is something else ignored ?
回答1:
So maybe as a work-around, try setting the correct outputCompressorClass directly in the configuration
context.getConfiguration().setOutputCompressorClass(GzipCodec.class);
just before your write call to each of the outputs. It does look like any output format configuration parameters other than key class, value class and output path are not handled well by MultipleOutputs and we may have to write a bit of code to offset that oversight.
来源:https://stackoverflow.com/questions/12953010/hadoop-how-to-output-different-format-types-in-the-same-job