distributed-cache | 易学教程

Files not put correctly into distributed cache

阅读更多关于 Files not put correctly into distributed cache

I am adding a file to distributed cache using the following code: Configuration conf2 = new Configuration(); job = new Job(conf2); job.setJobName("Join with Cache"); DistributedCache.addCacheFile(new URI("hdfs://server:port/FilePath/part-r-00000"), conf2); Then I read the file into the mappers: protected void setup(Context context)throws IOException,InterruptedException{ Configuration conf = context.getConfiguration(); URI[] cacheFile = DistributedCache.getCacheFiles(conf); FSDataInputStream in = FileSystem.get(conf).open(new Path(cacheFile[0].getPath())); BufferedReader joinReader = new

Confusion about distributed cache in Hadoop

阅读更多关于 Confusion about distributed cache in Hadoop

What does the distribute cache actually mean? Having a file in distributed cache means that is it available in every datanode and hence there will be no internode communication for that data, or does it mean that the file is in memory in every node? If not, by what means can I have a file in memory for the entire job? Can this be done both for map-reduce, as well as for a UDF.. (In particular there is some configuration data, comparatively small that I would like to keep in memory as a UDF applies on hive query...? ) Thanks and regards, Dhruv Kapur. DistributedCache is a facility provided by

Re-use files in Hadoop Distributed cache

阅读更多关于 Re-use files in Hadoop Distributed cache

I am wondering if someone can explain how the distributed cache works in Hadoop. I am running a job many times, and after each run I notice that the local distributed cache folder on each node is growing in size. Is there a way for multiple jobs to re-use the same file in the distributed cache? Or is the distributed cache only valid for the lifetime of any individual job? The reason I am confused is that the Hadoop documentation mentions that "DistributedCache tracks modification timestamps of the cache files", so this leads me to believe that if the time stamp hasn't changed, then it should

Hadoop - Large files in distributed cache

阅读更多关于 Hadoop - Large files in distributed cache

I have a 4 GB file that I am trying to share across all mappers through a distributed cache. But I am observing a significant delay in map task attempt starts. Specifically, there is a significant delay between the time I submit my job (through job.waitForCompletion()) and the time the first map starts. I would like to know what the side effect of having large files in a DistributedCache. How many times is the file on a distributed cache replicated ? Does the number of nodes in a cluster have any effect on this ? (My cluster has about 13 nodes running on very powerful machines where each

Hadoop - Large files in distributed cache

阅读更多关于 Hadoop - Large files in distributed cache

问题 I have a 4 GB file that I am trying to share across all mappers through a distributed cache. But I am observing a significant delay in map task attempt starts. Specifically, there is a significant delay between the time I submit my job (through job.waitForCompletion()) and the time the first map starts. I would like to know what the side effect of having large files in a DistributedCache. How many times is the file on a distributed cache replicated ? Does the number of nodes in a cluster have