My map tasks need some configuration data, which I would like to distribute via the Distributed Cache.
The Hadoop MapReduce Tutorial shows the usage of the Distribut
I had the same problem. And not only is DistributedCach deprecated but getLocalCacheFiles and "new Job" too. So what worked for me is the following:
Driver:
Configuration conf = getConf();
Job job = Job.getInstance(conf);
...
job.addCacheFile(new Path(filename).toUri());
In Mapper/Reducer setup:
@Override
protected void setup(Context context) throws IOException, InterruptedException
{
super.setup(context);
URI[] files = context.getCacheFiles(); // getCacheFiles returns null
Path file1path = new Path(files[0])
...
}
The APIs for the Distributed Cache can be found in the Job class itself. Check the documentation here: http://hadoop.apache.org/docs/stable2/api/org/apache/hadoop/mapreduce/Job.html The code should be something like
Job job = new Job();
...
job.addCacheFile(new Path(filename).toUri());
In your mapper code:
Path[] localPaths = context.getLocalCacheFiles();
...
None of the solution mentioned worked for me in completeness . It could because Hadoop version keeps changing I am using hadoop 2.6.4. Essentially, DistributedCache is deprecated so I didnt want to use that. As some of the post suggest us to use addCacheFile() however, it has changed a bit. Here is how it worked for me
job.addCacheFile(new URI("hdfs://X.X.X.X:9000/EnglishStop.txt#EnglishStop.txt"));
Here X.X.X.X can be Master IP address or localhost. The EnglishStop.txt was stored in HDFS at / location.
hadoop fs -ls /
The output is
-rw-r--r-- 3 centos supergroup 1833 2016-03-12 20:24 /EnglishStop.txt
drwxr-xr-x - centos supergroup 0 2016-03-12 19:46 /test
Funny but convenient, #EnglishStop.txt means now we can access it as "EnglishStop.txt" in mapper. Here is the code for the same
public void setup(Context context) throws IOException, InterruptedException
{
File stopwordFile = new File("EnglishStop.txt");
FileInputStream fis = new FileInputStream(stopwordFile);
BufferedReader reader = new BufferedReader(new InputStreamReader(fis));
while ((stopWord = reader.readLine()) != null) {
// stopWord is a word read from Cache
}
}
This just worked for me. You can read line from the file stored in HDFS
I did not use job.addCacheFile(). Instead I used -files option like "-files /path/to/myfile.txt#myfile" as before. Then in the mapper or reducer code I use the method below:
/**
* This method can be used with local execution or HDFS execution.
*
* @param context
* @param symLink
* @param throwExceptionIfNotFound
* @return
* @throws IOException
*/
public static File findDistributedFileBySymlink(JobContext context, String symLink, boolean throwExceptionIfNotFound) throws IOException
{
URI[] uris = context.getCacheFiles();
if(uris==null||uris.length==0)
{
if(throwExceptionIfNotFound)
throw new RuntimeException("Unable to find file with symlink '"+symLink+"' in distributed cache");
return null;
}
URI symlinkUri = null;
for(URI uri: uris)
{
if(symLink.equals(uri.getFragment()))
{
symlinkUri = uri;
break;
}
}
if(symlinkUri==null)
{
if(throwExceptionIfNotFound)
throw new RuntimeException("Unable to find file with symlink '"+symLink+"' in distributed cache");
return null;
}
//if we run this locally the file system URI scheme will be "file" otherwise it should be a symlink
return "file".equalsIgnoreCase(FileSystem.get(context.getConfiguration()).getScheme())?(new File(symlinkUri.getPath())):new File(symLink);
}
Then in mapper/reducer:
@Override
protected void setup(Context context) throws IOException, InterruptedException
{
super.setup(context);
File file = HadoopUtils.findDistributedFileBySymlink(context,"myfile",true);
... do work ...
}
Note that if I used "-files /path/to/myfile.txt" directly then I need to use "myfile.txt" to access the file since that is the default symlink name.
To expand on @jtravaglini, the preferred way of using DistributedCache
for YARN/MapReduce 2 is as follows:
In your driver, use the Job.addCacheFile()
public int run(String[] args) throws Exception {
Configuration conf = getConf();
Job job = Job.getInstance(conf, "MyJob");
job.setMapperClass(MyMapper.class);
// ...
// Mind the # sign after the absolute file location.
// You will be using the name after the # sign as your
// file name in your Mapper/Reducer
job.addCacheFile(new URI("/user/yourname/cache/some_file.json#some"));
job.addCacheFile(new URI("/user/yourname/cache/other_file.json#other"));
return job.waitForCompletion(true) ? 0 : 1;
}
And in your Mapper/Reducer, override the setup(Context context)
method:
@Override
protected void setup(
Mapper<LongWritable, Text, Text, Text>.Context context)
throws IOException, InterruptedException {
if (context.getCacheFiles() != null
&& context.getCacheFiles().length > 0) {
File some_file = new File("./some");
File other_file = new File("./other");
// Do things to these two files, like read them
// or parse as JSON or whatever.
}
super.setup(context);
}
The new DistributedCache API for YARN/MR2 is found in the org.apache.hadoop.mapreduce.Job
class.
Job.addCacheFile()
Unfortunately, there aren't as of yet many comprehensive tutorial-style examples of this.
http://hadoop.apache.org/docs/stable/api/org/apache/hadoop/mapreduce/Job.html#addCacheFile%28java.net.URI%29