How to overwrite/reuse the existing output path for Hadoop jobs again and agian

后端未结

关注

 10  877

I want to overwrite/reuse the existing output directory when I run my Hadoop job daily. Actually the output directory will store summarized output of each day\'s job run results

相关标签:

10条回答

情歌与酒

2021-02-12 11:19
You can create an output subdirectory for each execution by time. For example lets say you are expecting output directory from user and then set it as follows:
```
FileOutputFormat.setOutputPath(job, new Path(args[1]);
```
Change this by the following lines:
```
String timeStamp = new SimpleDateFormat("yyyy.MM.dd.HH.mm.ss", Locale.US).format(new Timestamp(System.currentTimeMillis()));
FileOutputFormat.setOutputPath(job, new Path(args[1] + "/" + timeStamp));
```
0 讨论(0)
发布评论:

提交评论
- 加载中...
旧巷少年郎

2021-02-12 11:19
I encountered this exact problem, it stems from the exception raised in checkOutputSpecs in the class FileOutputFormat. In my case, I wanted to have many jobs adding files to directories that already exist and I guaranteed that the files would have unique names.

I solved it by creating an output format class which overrides only the checkOutputSpecs method and suffocates (ignores) the FileAlreadyExistsException that's thrown where it checks if the directory already exists.
```
public class OverwriteTextOutputFormat<K, V> extends TextOutputFormat<K, V> {
    @Override
    public void checkOutputSpecs(JobContext job) throws IOException {
        try {
            super.checkOutputSpecs(job);
        }catch (FileAlreadyExistsException ignored){
            // Suffocate the exception
        }
    }
}
```
And the in the job configuration, I used LazyOutputFormat and also MultipleOutputs.
```
LazyOutputFormat.setOutputFormatClass(job, OverwriteTextOutputFormat.class);
```
0 讨论(0)
发布评论:

提交评论
- 加载中...
醉梦人生

2021-02-12 11:20

Hadoop follows the philosophy Write Once, Read Many times. Thus when you try to write to the directory again, it assumes it has to make a new one (Write once) but it already exists, and so it complains. You can delete it via hadoop fs -rmr /path/to/your/output/. It's better to create a dynamic directory (eg,based on timestamp or hash value) in order to preserve data.

0 讨论(0)
发布评论:

提交评论
- 加载中...
长发绾君心

2021-02-12 11:22
If one is loading the input file (with e.g., appended entries) from the local file system to hadoop distributed file system as such:
```
hdfs dfs -put  /mylocalfile /user/cloudera/purchase
```
Then one could also overwrite/reuse the existing output directory with -f. No need to delete or re-create the folder
```
hdfs dfs -put -f  /updated_mylocalfile /user/cloudera/purchase
```
0 讨论(0)
发布评论:

提交评论
- 加载中...

上一页 1 2