Hadoop - Large files in distributed cache

I have a 4 GB file that I am trying to share across all mappers through a distributed cache. But I am observing a significant delay in map task attempt starts. Specifically, there is a significant delay between the time I submit my job (through job.waitForCompletion()) and the time the first map starts.

I would like to know what the side effect of having large files in a DistributedCache. How many times is the file on a distributed cache replicated ? Does the number of nodes in a cluster have any effect on this ?

(My cluster has about 13 nodes running on very powerful machines where each machine is able to host close to 10 map slots.)

Thanks

"Cache" in this case is a bit misleading. Your 4 GB file will be distributed to every task along with the jars and configuration.

For files larger than 200mb I usually put them directly into the filesystem and set the replication to a higher value than the usual replication (in your case I would set this to 5-7). You can directly read from the distributed filesystem in every task by the usual FS commands like:

FileSystem fs = FileSystem.get(config);
fs.open(new Path("/path/to/the/larger/file"));

This saves space in the cluster, but also should not delay the task start. However, in case of non-local HDFS reads, it needs to stream the data to the task which might use a considerable amount of bandwidth.

来源：https://stackoverflow.com/questions/17291344/hadoop-large-files-in-distributed-cache

标签

Hadoop

distributed-cache

易学教程内所有资源均来自网络或用户发布的内容，如有违反法律规定的内容欢迎反馈！
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!