I want to overwrite/reuse the existing output directory when I run my Hadoop job daily. Actually the output directory will store summarized output of each day\'s job run results
You can create an output subdirectory for each execution by time. For example lets say you are expecting output directory from user and then set it as follows:
FileOutputFormat.setOutputPath(job, new Path(args[1]);
Change this by the following lines:
String timeStamp = new SimpleDateFormat("yyyy.MM.dd.HH.mm.ss", Locale.US).format(new Timestamp(System.currentTimeMillis()));
FileOutputFormat.setOutputPath(job, new Path(args[1] + "/" + timeStamp));
I encountered this exact problem, it stems from the exception raised in checkOutputSpecs
in the class FileOutputFormat
. In my case, I wanted to have many jobs adding files to directories that already exist and I guaranteed that the files would have unique names.
I solved it by creating an output format class which overrides only the checkOutputSpecs
method and suffocates (ignores) the FileAlreadyExistsException
that's thrown where it checks if the directory already exists.
public class OverwriteTextOutputFormat<K, V> extends TextOutputFormat<K, V> {
@Override
public void checkOutputSpecs(JobContext job) throws IOException {
try {
super.checkOutputSpecs(job);
}catch (FileAlreadyExistsException ignored){
// Suffocate the exception
}
}
}
And the in the job configuration, I used LazyOutputFormat
and also MultipleOutputs
.
LazyOutputFormat.setOutputFormatClass(job, OverwriteTextOutputFormat.class);
Hadoop follows the philosophy Write Once, Read Many times. Thus when you try to write to the directory again, it assumes it has to make a new one (Write once) but it already exists, and so it complains. You can delete it via hadoop fs -rmr /path/to/your/output/
. It's better to create a dynamic directory (eg,based on timestamp or hash value) in order to preserve data.
If one is loading the input file (with e.g., appended entries) from the local file system to hadoop distributed file system as such:
hdfs dfs -put /mylocalfile /user/cloudera/purchase
Then one could also overwrite/reuse the existing output directory with -f
. No need to delete or re-create the folder
hdfs dfs -put -f /updated_mylocalfile /user/cloudera/purchase