How to overwrite/reuse the existing output path for Hadoop jobs again and agian

后端未结

关注

 10  882

既然无缘 2021-02-12 10:29

I want to overwrite/reuse the existing output directory when I run my Hadoop job daily. Actually the output directory will store summarized output of each day\'s job run results

10条回答

旧巷少年郎 (楼主)

2021-02-12 11:03
I had a similar use case, I use MultipleOutputs to resolve this.

For example, if I want different MapReduce jobs to write to the same directory /outputDir/. Job 1 writes to /outputDir/job1-part1.txt, job 2 writes to /outputDir/job1-part2.txt (without deleting exiting files).

In the main, set the output directory to a random one (it can be deleted before a new job runs)
```
FileInputFormat.addInputPath(job, new Path("/randomPath"));
```
In the reducer/mapper, use MultipleOutputs and set the writer to write to the desired directory:
```
public void setup(Context context) {
    MultipleOutputs mos = new MultipleOutputs(context);
}
```
and:
```
mos.write(key, value, "/outputDir/fileOfJobX.txt")
```
However, my use case was a bit complicated than that. If it's just to write to the same flat directory, you can write to a different directory and runs a script to migrate the files, like: hadoop fs -mv /tmp/* /outputDir

In my use case, each MapReduce job writes to different sub-directories based on the value of the message being writing. The directory structure can be multi-layered like:
```
/outputDir/
    messageTypeA/
        messageSubTypeA1/
            job1Output/
                job1-part1.txt
                job1-part2.txt
                ...
            job2Output/
                job2-part1.txt
                ...

        messageSubTypeA2/
        ...
    messageTypeB/
    ...
```
Each Mapreduce job can write to thousands of sub-directories. And the cost of writing to a tmp dir and moving each files to the correct directory is high.
0 讨论(0)

查看其它10个回答
发布评论:

提交评论
- 加载中...