How to overwrite/reuse the existing output path for Hadoop jobs again and agian

后端 未结 10 854
既然无缘
既然无缘 2021-02-12 10:29

I want to overwrite/reuse the existing output directory when I run my Hadoop job daily. Actually the output directory will store summarized output of each day\'s job run results

相关标签:
10条回答
  • 2021-02-12 11:00

    Jungblut's answer is your direct solution. Since I never trust automated processes to delete stuff (me personally), I'll suggest an alternative:

    Instead of trying to overwrite, I suggest you make the output name of your job dynamic, including the time in which it ran.

    Something like "/path/to/your/output-2011-10-09-23-04/". This way you can keep around your old job output in case you ever need to revisit in. In my system, which runs 10+ daily jobs, we structure the output to be: /output/job1/2011/10/09/job1out/part-r-xxxxx, /output/job1/2011/10/10/job1out/part-r-xxxxx, etc.

    0 讨论(0)
  • 2021-02-12 11:03

    I had a similar use case, I use MultipleOutputs to resolve this.

    For example, if I want different MapReduce jobs to write to the same directory /outputDir/. Job 1 writes to /outputDir/job1-part1.txt, job 2 writes to /outputDir/job1-part2.txt (without deleting exiting files).

    In the main, set the output directory to a random one (it can be deleted before a new job runs)

    FileInputFormat.addInputPath(job, new Path("/randomPath"));
    

    In the reducer/mapper, use MultipleOutputs and set the writer to write to the desired directory:

    public void setup(Context context) {
        MultipleOutputs mos = new MultipleOutputs(context);
    }
    

    and:

    mos.write(key, value, "/outputDir/fileOfJobX.txt")
    

    However, my use case was a bit complicated than that. If it's just to write to the same flat directory, you can write to a different directory and runs a script to migrate the files, like: hadoop fs -mv /tmp/* /outputDir

    In my use case, each MapReduce job writes to different sub-directories based on the value of the message being writing. The directory structure can be multi-layered like:

    /outputDir/
        messageTypeA/
            messageSubTypeA1/
                job1Output/
                    job1-part1.txt
                    job1-part2.txt
                    ...
                job2Output/
                    job2-part1.txt
                    ...
    
            messageSubTypeA2/
            ...
        messageTypeB/
        ...
    

    Each Mapreduce job can write to thousands of sub-directories. And the cost of writing to a tmp dir and moving each files to the correct directory is high.

    0 讨论(0)
  • 2021-02-12 11:11

    What about deleting the directory before you run the job?

    You can do this via shell:

    hadoop fs -rmr /path/to/your/output/
    

    or via the Java API:

    // configuration should contain reference to your namenode
    FileSystem fs = FileSystem.get(new Configuration());
    // true stands for recursively deleting the folder you gave
    fs.delete(new Path("/path/to/your/output"), true);
    
    0 讨论(0)
  • 2021-02-12 11:12

    you need to add the setting in your main class:

    //Configuring the output path from the filesystem into the job
    FileOutputFormat.setOutputPath(job, new Path(args[1]));
    //auto_delete output dir
    OutputPath.getFileSystem(conf).delete(OutputPath);
    
    0 讨论(0)
  • 2021-02-12 11:17

    Hadoop already supports the effect you seem to be trying to achieve by allowing multiple input paths to a job. Instead of trying to have a single directory of files to which you add more files, have a directory of directories to which you add new directories. To use the aggregate result as input, simply specify the input glob as a wildcard over the subdirectories (e.g., my-aggregate-output/*). To "append" new data to the aggregate as output, simply specify a new unique subdirectory of the aggregate as the output directory, generally using a timestamp or some sequence number derived from your input data (e.g. my-aggregate-output/20140415154424).

    0 讨论(0)
  • 2021-02-12 11:19

    Hadoop's TextInputFormat (which I guess you are using) does not allow overwriting an existing directory. Probably to excuse you the pain of finding out you mistakenly deleted something you (and your cluster) worked very hard on.

    However, If you are certain you want your output folder to be overwritten by the job, I believe the cleanest way is to change TextOutputFormat a little like this:

    public class OverwriteTextOutputFormat<K, V> extends TextOutputFormat<K, V>
    {
          public RecordWriter<K, V> 
          getRecordWriter(TaskAttemptContext job) throws IOException, InterruptedException 
          {
              Configuration conf = job.getConfiguration();
              boolean isCompressed = getCompressOutput(job);
              String keyValueSeparator= conf.get("mapred.textoutputformat.separator","\t");
              CompressionCodec codec = null;
              String extension = "";
              if (isCompressed) 
              {
                  Class<? extends CompressionCodec> codecClass = 
                          getOutputCompressorClass(job, GzipCodec.class);
                  codec = (CompressionCodec) ReflectionUtils.newInstance(codecClass, conf);
                  extension = codec.getDefaultExtension();
              }
              Path file = getDefaultWorkFile(job, extension);
              FileSystem fs = file.getFileSystem(conf);
              FSDataOutputStream fileOut = fs.create(file, true);
              if (!isCompressed) 
              {
                  return new LineRecordWriter<K, V>(fileOut, keyValueSeparator);
              } 
              else 
              {
                  return new LineRecordWriter<K, V>(new DataOutputStream(codec.createOutputStream(fileOut)),keyValueSeparator);
              }
          }
    }
    

    Now you are creating the FSDataOutputStream (fs.create(file, true)) with overwrite=true.

    0 讨论(0)
提交回复
热议问题