I want to use the distributed cache to allow my mappers to access data. In main, I\'m using the command
DistributedCache.addCacheFile(new URI(\"/user/peter/cac
Problem here was that I was doing the following:
Configuration conf = new Configuration();
Job job = new Job(conf, "wordcount");
DistributedCache.addCacheFile(new URI("/user/peter/cacheFile/testCache1"), conf);
Since the Job constructor makes an internal copy of the conf instance, adding the cache file afterwards doesn't affect things. Instead, I should do this:
Configuration conf = new Configuration();
DistributedCache.addCacheFile(new URI("/user/peter/cacheFile/testCache1"), conf);
Job job = new Job(conf, "wordcount");
And now it works. Thanks to Harsh on hadoop user list for the help.
Configuration conf = new Configuration();
Job job = new Job(conf, "wordcount");
DistributedCache.addCacheFile(new URI("/userpetercacheFiletestCache1"),job.getConfiguration());
You can also do it in this way.
This version of code ( which is slightly different from the above mentioned constructs) has always worked for me.
//in main(String [] args)
Job job = new Job(conf,"Word Count");
...
DistributedCache.addCacheFile(new URI(/user/peter/cacheFile/testCache1), job.getConfiguration());
I didnt see the complete setup() function in Mapper code
public void setup(Context context) throws IOException, InterruptedException {
Configuration conf = context.getConfiguration();
FileSystem fs = FileSystem.getLocal(conf);
Path[] dataFile = DistributedCache.getLocalCacheFiles(conf);
// [0] because we added just one file.
BufferedReader cacheReader = new BufferedReader(new InputStreamReader(fs.open(dataFile[0])));
// now one can use BufferedReader's readLine() to read data
}
Once the Job is assigned to with a configuration object,
ie Configuration conf = new Configuration();
Job job = new Job(conf, "wordcount");
And then if deal with attributes of conf as shown below, eg
conf.set("demiliter","|");
or
DistributedCache.addCacheFile(new URI("/user/peter/cacheFile/testCache1"), conf);
Such changes would not be reflected in a pseudo cluster or cluster how ever it would work with local environment.