How to overwrite/reuse the existing output path for Hadoop jobs again and agian

后端 未结 10 882
既然无缘
既然无缘 2021-02-12 10:29

I want to overwrite/reuse the existing output directory when I run my Hadoop job daily. Actually the output directory will store summarized output of each day\'s job run results

10条回答
  •  旧巷少年郎
    2021-02-12 11:03

    I had a similar use case, I use MultipleOutputs to resolve this.

    For example, if I want different MapReduce jobs to write to the same directory /outputDir/. Job 1 writes to /outputDir/job1-part1.txt, job 2 writes to /outputDir/job1-part2.txt (without deleting exiting files).

    In the main, set the output directory to a random one (it can be deleted before a new job runs)

    FileInputFormat.addInputPath(job, new Path("/randomPath"));
    

    In the reducer/mapper, use MultipleOutputs and set the writer to write to the desired directory:

    public void setup(Context context) {
        MultipleOutputs mos = new MultipleOutputs(context);
    }
    

    and:

    mos.write(key, value, "/outputDir/fileOfJobX.txt")
    

    However, my use case was a bit complicated than that. If it's just to write to the same flat directory, you can write to a different directory and runs a script to migrate the files, like: hadoop fs -mv /tmp/* /outputDir

    In my use case, each MapReduce job writes to different sub-directories based on the value of the message being writing. The directory structure can be multi-layered like:

    /outputDir/
        messageTypeA/
            messageSubTypeA1/
                job1Output/
                    job1-part1.txt
                    job1-part2.txt
                    ...
                job2Output/
                    job2-part1.txt
                    ...
    
            messageSubTypeA2/
            ...
        messageTypeB/
        ...
    

    Each Mapreduce job can write to thousands of sub-directories. And the cost of writing to a tmp dir and moving each files to the correct directory is high.

提交回复
热议问题