I want to overwrite/reuse the existing output directory when I run my Hadoop job daily. Actually the output directory will store summarized output of each day\'s job run results
Hadoop's TextInputFormat
(which I guess you are using) does not allow overwriting an existing directory. Probably to excuse you the pain of finding out you mistakenly deleted something you (and your cluster) worked very hard on.
However, If you are certain you want your output folder to be overwritten by the job, I believe the cleanest way is to change TextOutputFormat
a little like this:
public class OverwriteTextOutputFormat extends TextOutputFormat
{
public RecordWriter
getRecordWriter(TaskAttemptContext job) throws IOException, InterruptedException
{
Configuration conf = job.getConfiguration();
boolean isCompressed = getCompressOutput(job);
String keyValueSeparator= conf.get("mapred.textoutputformat.separator","\t");
CompressionCodec codec = null;
String extension = "";
if (isCompressed)
{
Class extends CompressionCodec> codecClass =
getOutputCompressorClass(job, GzipCodec.class);
codec = (CompressionCodec) ReflectionUtils.newInstance(codecClass, conf);
extension = codec.getDefaultExtension();
}
Path file = getDefaultWorkFile(job, extension);
FileSystem fs = file.getFileSystem(conf);
FSDataOutputStream fileOut = fs.create(file, true);
if (!isCompressed)
{
return new LineRecordWriter(fileOut, keyValueSeparator);
}
else
{
return new LineRecordWriter(new DataOutputStream(codec.createOutputStream(fileOut)),keyValueSeparator);
}
}
}
Now you are creating the FSDataOutputStream
(fs.create(file, true)
) with overwrite=true.