Renaming Part Files in Hadoop Map Reduce

后端 未结 2 491
傲寒
傲寒 2020-12-05 05:18

I have tried to use the MultipleOutputs class as per the example in page http://hadoop.apache.org/docs/mapreduce/r0.21.0/api/index.html?org/apache/hadoop/mapred

相关标签:
2条回答
  • 2020-12-05 05:47

    Even if you are using MultipleOutputs, the default OutputFormat (I believe it is TextOutputFormat) is still being used, and so it will initialize and creating these part-r-xxxxx files that you are seeing.

    The fact that they are empty is because you are not doing any context.write because you are using MultipleOutputs. But that doesn't prevent them from being created during initialization.

    To get rid of them, you need to define your OutputFormat to say you are not expecting any output. You can do it this way:

    job.setOutputFormat(NullOutputFormat.class);
    

    With that property set, this should ensure that your part files are never initialized at all, but you still get your output in the MultipleOutputs.

    You could also probably use LazyOutputFormat which would ensure that your output files are only created when/if there is some data, and not initialize empty files. You could do i this way:

    import org.apache.hadoop.mapreduce.lib.output.LazyOutputFormat; 
    LazyOutputFormat.setOutputFormatClass(job, TextOutputFormat.class);
    

    Note that you are using in your Reducer the prototype MultipleOutputs.write(String namedOutput, K key, V value), which just uses a default output path that will be generated based on your namedOutput to something like: {namedOutput}-(m|r)-{part-number}. If you want to have more control over your output filenames, you should use the prototype MultipleOutputs.write(String namedOutput, K key, V value, String baseOutputPath) which can allow you to get filenames generated at runtime based on your keys/values.

    0 讨论(0)
  • 2020-12-05 05:52

    This is all you need to do in the Driver class to change the basename of the output file: job.getConfiguration().set("mapreduce.output.basename", "text"); So this will result in your files being called "text-r-00000".

    0 讨论(0)
提交回复
热议问题