distributed-cache

Files not put correctly into distributed cache

只愿长相守 提交于 2019-12-03 00:45:57
I am adding a file to distributed cache using the following code: Configuration conf2 = new Configuration(); job = new Job(conf2); job.setJobName("Join with Cache"); DistributedCache.addCacheFile(new URI("hdfs://server:port/FilePath/part-r-00000"), conf2); Then I read the file into the mappers: protected void setup(Context context)throws IOException,InterruptedException{ Configuration conf = context.getConfiguration(); URI[] cacheFile = DistributedCache.getCacheFiles(conf); FSDataInputStream in = FileSystem.get(conf).open(new Path(cacheFile[0].getPath())); BufferedReader joinReader = new

Confusion about distributed cache in Hadoop

扶醉桌前 提交于 2019-12-01 16:14:19
What does the distribute cache actually mean? Having a file in distributed cache means that is it available in every datanode and hence there will be no internode communication for that data, or does it mean that the file is in memory in every node? If not, by what means can I have a file in memory for the entire job? Can this be done both for map-reduce, as well as for a UDF.. (In particular there is some configuration data, comparatively small that I would like to keep in memory as a UDF applies on hive query...? ) Thanks and regards, Dhruv Kapur. DistributedCache is a facility provided by

Re-use files in Hadoop Distributed cache

蹲街弑〆低调 提交于 2019-11-30 16:16:13
I am wondering if someone can explain how the distributed cache works in Hadoop. I am running a job many times, and after each run I notice that the local distributed cache folder on each node is growing in size. Is there a way for multiple jobs to re-use the same file in the distributed cache? Or is the distributed cache only valid for the lifetime of any individual job? The reason I am confused is that the Hadoop documentation mentions that "DistributedCache tracks modification timestamps of the cache files", so this leads me to believe that if the time stamp hasn't changed, then it should

Hadoop - Large files in distributed cache

南楼画角 提交于 2019-11-28 14:23:19
I have a 4 GB file that I am trying to share across all mappers through a distributed cache. But I am observing a significant delay in map task attempt starts. Specifically, there is a significant delay between the time I submit my job (through job.waitForCompletion()) and the time the first map starts. I would like to know what the side effect of having large files in a DistributedCache. How many times is the file on a distributed cache replicated ? Does the number of nodes in a cluster have any effect on this ? (My cluster has about 13 nodes running on very powerful machines where each

Hadoop - Large files in distributed cache

本秂侑毒 提交于 2019-11-27 08:25:45
问题 I have a 4 GB file that I am trying to share across all mappers through a distributed cache. But I am observing a significant delay in map task attempt starts. Specifically, there is a significant delay between the time I submit my job (through job.waitForCompletion()) and the time the first map starts. I would like to know what the side effect of having large files in a DistributedCache. How many times is the file on a distributed cache replicated ? Does the number of nodes in a cluster have