I have a 4 GB file that I am trying to share across all mappers through a distributed cache. But I am observing a significant delay in map task attempt starts. Specifically, there is a significant delay between the time I submit my job (through job.waitForCompletion()) and the time the first map starts.
I would like to know what the side effect of having large files in a DistributedCache. How many times is the file on a distributed cache replicated ? Does the number of nodes in a cluster have any effect on this ?
(My cluster has about 13 nodes running on very powerful machines where each machine is able to host close to 10 map slots.)
Thanks
"Cache" in this case is a bit misleading. Your 4 GB file will be distributed to every task along with the jars and configuration.
For files larger than 200mb I usually put them directly into the filesystem and set the replication to a higher value than the usual replication (in your case I would set this to 5-7). You can directly read from the distributed filesystem in every task by the usual FS commands like:
FileSystem fs = FileSystem.get(config);
fs.open(new Path("/path/to/the/larger/file"));
This saves space in the cluster, but also should not delay the task start. However, in case of non-local HDFS reads, it needs to stream the data to the task which might use a considerable amount of bandwidth.
来源:https://stackoverflow.com/questions/17291344/hadoop-large-files-in-distributed-cache