How YARN knows data locality in Apache spark in cluster mode

前端未结

关注

 1  1218

Assume that there is Spark job that is going to read a file named records.txt from HDFS and do some transformations and one action(write the processed output into HDFS). Th

相关标签:

1条回答

执笔经年

2020-12-11 08:50

The fundamental question here is:

Does YARN know about the datalocality ?

YARN "knows" what application tells it and it understand structure (topology) of the cluster. When application makes a resource request, it can include specific locality constraints, which might, or might not be satisfied, when resources are allocated.

If constraints cannot be specified, YARN (or any other cluster manager) will attempt to provide best alternative match, based on its knowledge of the cluster topology.

So how application "knows"?

If application uses input source (file system or other), which supports some form of data locality, it can query it corresponding catalog (namenode in case of HDFS) to get locations of the blocks of data it wants to access.

In broader sense Spark RDD can define preferredLocations, depending on a specific RDD implementation, which can be later translated into resource constraints, for the cluster manager (not necessarily YARN).

0 讨论(0)
发布评论:

提交评论
- 加载中...