I have tried to use the MultipleOutputs
class as per the example in page http://hadoop.apache.org/docs/mapreduce/r0.21.0/api/index.html?org/apache/hadoop/mapred
Even if you are using MultipleOutputs
, the default OutputFormat
(I believe it is TextOutputFormat
) is still being used, and so it will initialize and creating these part-r-xxxxx
files that you are seeing.
The fact that they are empty is because you are not doing any context.write
because you are using MultipleOutputs
. But that doesn't prevent them from being created during initialization.
To get rid of them, you need to define your OutputFormat
to say you are not expecting any output. You can do it this way:
job.setOutputFormat(NullOutputFormat.class);
With that property set, this should ensure that your part files are never initialized at all, but you still get your output in the MultipleOutputs
.
You could also probably use LazyOutputFormat
which would ensure that your output files are only created when/if there is some data, and not initialize empty files. You could do i this way:
import org.apache.hadoop.mapreduce.lib.output.LazyOutputFormat;
LazyOutputFormat.setOutputFormatClass(job, TextOutputFormat.class);
Note that you are using in your Reducer
the prototype MultipleOutputs.write(String namedOutput, K key, V value)
, which just uses a default output path that will be generated based on your namedOutput
to something like: {namedOutput}-(m|r)-{part-number}
. If you want to have more control over your output filenames, you should use the prototype MultipleOutputs.write(String namedOutput, K key, V value, String baseOutputPath)
which can allow you to get filenames generated at runtime based on your keys/values.
This is all you need to do in the Driver class to change the basename of the output file:
job.getConfiguration().set("mapreduce.output.basename", "text");
So this will result in your files being called "text-r-00000".