How to overwrite/reuse the existing output path for Hadoop jobs again and agian

后端 未结 10 878
既然无缘
既然无缘 2021-02-12 10:29

I want to overwrite/reuse the existing output directory when I run my Hadoop job daily. Actually the output directory will store summarized output of each day\'s job run results

10条回答
  •  一向
    一向 (楼主)
    2021-02-12 11:19

    Hadoop's TextInputFormat (which I guess you are using) does not allow overwriting an existing directory. Probably to excuse you the pain of finding out you mistakenly deleted something you (and your cluster) worked very hard on.

    However, If you are certain you want your output folder to be overwritten by the job, I believe the cleanest way is to change TextOutputFormat a little like this:

    public class OverwriteTextOutputFormat extends TextOutputFormat
    {
          public RecordWriter 
          getRecordWriter(TaskAttemptContext job) throws IOException, InterruptedException 
          {
              Configuration conf = job.getConfiguration();
              boolean isCompressed = getCompressOutput(job);
              String keyValueSeparator= conf.get("mapred.textoutputformat.separator","\t");
              CompressionCodec codec = null;
              String extension = "";
              if (isCompressed) 
              {
                  Class codecClass = 
                          getOutputCompressorClass(job, GzipCodec.class);
                  codec = (CompressionCodec) ReflectionUtils.newInstance(codecClass, conf);
                  extension = codec.getDefaultExtension();
              }
              Path file = getDefaultWorkFile(job, extension);
              FileSystem fs = file.getFileSystem(conf);
              FSDataOutputStream fileOut = fs.create(file, true);
              if (!isCompressed) 
              {
                  return new LineRecordWriter(fileOut, keyValueSeparator);
              } 
              else 
              {
                  return new LineRecordWriter(new DataOutputStream(codec.createOutputStream(fileOut)),keyValueSeparator);
              }
          }
    }
    

    Now you are creating the FSDataOutputStream (fs.create(file, true)) with overwrite=true.

提交回复
热议问题