I am able to use multipleoutputs feature in hadoop
Question 1: Writing output files to different directories - you can do it using the following approaches:
1. Using MultipleOutputs class:
Its great that you are able to create multiple named output files using MultipleOutputs. As you know, we need to add this in your driver code.
MultipleOutputs.addNamedOutput(job, "OutputFileName", OutputFormatClass, keyClass, valueClass);
The API provides two overloaded write methods to achieve this.
multipleOutputs.write("OutputFileName", new Text(Key), new Text(Value));
Now, to write the output file to separate output directories, you need to use an overloaded write method with an extra parameter for the base output path.
multipleOutputs.write("OutputFileName", new Text(key), new Text(value), baseOutputPath);
Please remember to change your baseOutputPath in each of your implementation.
2. Rename/Move the file in driver class:
This is probably the easiest hack to write output to multiple directories. Use multipleOutputs and write all the output files to a single output directory. But the file names need to be different for each category.
Assume that you want to create 3 different sets of output files, the first step is to register named output files in the driver:
MultipleOutputs.addNamedOutput(job, "set1", OutputFormatClass, keyClass, valueClass);
MultipleOutputs.addNamedOutput(job, "set2", OutputFormatClass, keyClass, valueClass);
MultipleOutputs.addNamedOutput(job, "set3", OutputFormatClass, keyClass, valueClass);
Also, create the different output directories or the directory structure you want in the driver code, along with the actual output directory:
Path set1Path = new Path("/hdfsRoot/outputs/set1");
Path set2Path = new Path("/hdfsRoot/outputs/set2");
Path set3Path = new Path("/hdfsRoot/outputs/set3");
The final important step is to rename the output files based on their names. If the job is successful;
FileSystem fileSystem = FileSystem.get(new Configuration);
if (jobStatus == 0) {
// Get the output files from the actual output path
FileStatus outputfs[] = fileSystem.listStatus(outputPath);
// Iterate over all the files in the output path
for (int fileCounter = 0; fileCounter < outputfs.length; fileCounter++) {
// Based on each fileName rename the path.
if (outputfs[fileCounter].getPath().getName().contains("set1")) {
fileSystem.rename(outputfs[fileCounter].getPath(), new Path(set1Path+"/"+anyNewFileName));
} else if (outputfs[fileCounter].getPath().getName().contains("set2")) {
fileSystem.rename(outputfs[fileCounter].getPath(), new Path(set2Path+"/"+anyNewFileName));
} else if (outputfs[fileCounter].getPath().getName().contains("set3")) {
fileSystem.rename(outputfs[fileCounter].getPath(), new Path(set3Path+"/"+anyNewFileName));
}
}
}
Note: This will not add any significant overhead to the job because we are only MOVING files from one directory to another. And choosing any particular approach depends on the nature of your implementation.
In summary, this approach basically writes all the output files using different names to the same output directory and when the job is successfully completed, we rename the base output path and move files to different output directories.
Question 2: Reading specific files from an input folder(s):
You can definitely read specific input files from a directory using MultipleInputs class.
Based on your input path/file names you can pass the input files to the corresponding Mapper implementation.
Case 1: If all the input files ARE IN a single directory:
FileStatus inputfs[] = fileSystem.listStatus(inputPath);
for (int fileCounter = 0; fileCounter < inputfs.length; fileCounter++) {
if (inputfs[fileCounter].getPath().getName().contains("set1")) {
MultipleInputs.addInputPath(job, inputfs[fileCounter].getPath(), TextInputFormat.class, Set1Mapper.class);
} else if (inputfs[fileCounter].getPath().getName().contains("set2")) {
MultipleInputs.addInputPath(job, inputfs[fileCounter].getPath(), TextInputFormat.class, Set2Mapper.class);
} else if (inputfs[fileCounter].getPath().getName().contains("set3")) {
MultipleInputs.addInputPath(job, inputfs[fileCounter].getPath(), TextInputFormat.class, Set3Mapper.class);
}
}
Case 2: If all the input files ARE NOT IN a single directory:
We can basically use the same approach above even if the input files are in different directories. Iterate over the base input path and check the file path name for a matching criteria.
Or, if the files are in complete different locations, the simplest way is to add to multiple inputs individually.
MultipleInputs.addInputPath(job, Set1_Path, TextInputFormat.class, Set1Mapper.class);
MultipleInputs.addInputPath(job, Set2_Path, TextInputFormat.class, Set2Mapper.class);
MultipleInputs.addInputPath(job, Set3_Path, TextInputFormat.class, Set3Mapper.class);
Hope this helps! Thank you.
Yes you can specify that a input format only processes certain files:
FileInputFormat.setInputPaths(job, "/path/to/folder/testfile*");
If you do amend the code, remember the _SUCCESS file should be written to both folders upon successful job completion - while this isn't a requirement, it is a machanism by which someone can determine if the output in that folder is complete, and not 'truncated' because of an error.
Copy the MultipleOutputs code into your code base and loosen the restriction on allowable characters. I can't see any valid reason for the restrictions anyway.
Yes you can do this. All you need to do is generate the file name for a particular key/value pair coming out of the reducer.
If you override a method, you can return the file name depending on what key/value pair you get, and so on. Here is the link that shows you how to do that.
https://www.google.co.in/url?sa=t&rct=j&q=&esrc=s&source=web&cd=1&ved=0CFMQFjAA&url=https%3A%2F%2Fsites.google.com%2Fsite%2Fhadoopandhive%2Fhome%2Fhow-to-write-output-to-multiple-named-files-in-hadoop-using-multipletextoutputformat&ei=y7YBULarN8iIrAf4iPSOBg&usg=AFQjCNHbd8sRwlY1-My2gNYI0yqw4254YQ