Hadoop - Large files in distributed cache

南楼画角 提交于 2019-11-28 14:23:19

"Cache" in this case is a bit misleading. Your 4 GB file will be distributed to every task along with the jars and configuration.

For files larger than 200mb I usually put them directly into the filesystem and set the replication to a higher value than the usual replication (in your case I would set this to 5-7). You can directly read from the distributed filesystem in every task by the usual FS commands like:

FileSystem fs = FileSystem.get(config);
fs.open(new Path("/path/to/the/larger/file"));

This saves space in the cluster, but also should not delay the task start. However, in case of non-local HDFS reads, it needs to stream the data to the task which might use a considerable amount of bandwidth.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!