Assume that there is Spark job that is going to read a file named records.txt from HDFS and do some transformations and one action(write the processed output into HDFS). Th
The fundamental question here is:
Does YARN know about the datalocality ?
YARN "knows" what application tells it and it understand structure (topology) of the cluster. When application makes a resource request, it can include specific locality constraints, which might, or might not be satisfied, when resources are allocated.
If constraints cannot be specified, YARN (or any other cluster manager) will attempt to provide best alternative match, based on its knowledge of the cluster topology.
So how application "knows"?
If application uses input source (file system or other), which supports some form of data locality, it can query it corresponding catalog (namenode in case of HDFS) to get locations of the blocks of data it wants to access.
In broader sense Spark RDD can define preferredLocations
, depending on a specific RDD
implementation, which can be later translated into resource constraints, for the cluster manager (not necessarily YARN).