How to overwrite/reuse the existing output path for Hadoop jobs again and agian

后端 未结 10 877
既然无缘
既然无缘 2021-02-12 10:29

I want to overwrite/reuse the existing output directory when I run my Hadoop job daily. Actually the output directory will store summarized output of each day\'s job run results

相关标签:
10条回答
  • 2021-02-12 11:19

    You can create an output subdirectory for each execution by time. For example lets say you are expecting output directory from user and then set it as follows:

    FileOutputFormat.setOutputPath(job, new Path(args[1]);
    

    Change this by the following lines:

    String timeStamp = new SimpleDateFormat("yyyy.MM.dd.HH.mm.ss", Locale.US).format(new Timestamp(System.currentTimeMillis()));
    FileOutputFormat.setOutputPath(job, new Path(args[1] + "/" + timeStamp));
    
    0 讨论(0)
  • 2021-02-12 11:19

    I encountered this exact problem, it stems from the exception raised in checkOutputSpecs in the class FileOutputFormat. In my case, I wanted to have many jobs adding files to directories that already exist and I guaranteed that the files would have unique names.

    I solved it by creating an output format class which overrides only the checkOutputSpecs method and suffocates (ignores) the FileAlreadyExistsException that's thrown where it checks if the directory already exists.

    public class OverwriteTextOutputFormat<K, V> extends TextOutputFormat<K, V> {
        @Override
        public void checkOutputSpecs(JobContext job) throws IOException {
            try {
                super.checkOutputSpecs(job);
            }catch (FileAlreadyExistsException ignored){
                // Suffocate the exception
            }
        }
    }
    

    And the in the job configuration, I used LazyOutputFormat and also MultipleOutputs.

    LazyOutputFormat.setOutputFormatClass(job, OverwriteTextOutputFormat.class);
    
    0 讨论(0)
  • 2021-02-12 11:20

    Hadoop follows the philosophy Write Once, Read Many times. Thus when you try to write to the directory again, it assumes it has to make a new one (Write once) but it already exists, and so it complains. You can delete it via hadoop fs -rmr /path/to/your/output/. It's better to create a dynamic directory (eg,based on timestamp or hash value) in order to preserve data.

    0 讨论(0)
  • 2021-02-12 11:22

    If one is loading the input file (with e.g., appended entries) from the local file system to hadoop distributed file system as such:

    hdfs dfs -put  /mylocalfile /user/cloudera/purchase
    

    Then one could also overwrite/reuse the existing output directory with -f. No need to delete or re-create the folder

    hdfs dfs -put -f  /updated_mylocalfile /user/cloudera/purchase
    
    0 讨论(0)
提交回复
热议问题