Apache Spark has recently updated the version to 0.8.1, in which yarn-client
mode is available. My question is, what does yarn-client mode really mean? In the d
Both spark and yarn are distributed framework , but their roles are different:
Yarn is a resource management framework, for each application, it has following roles:
ApplicationMaster: resource management of a single application, including ask for/release resource from Yarn for the application and monitor.
Attempt: an attempt is just a normal process which does part of the whole job of the application. For example , a mapreduce job which consists of multiple mappers and reducers , each mapper and reducer is an Attempt.
A common process of summiting a application to yarn is:
The client submit the application request to yarn. In the
request, Yarn should know the ApplicationMaster class; For
SparkApplication, it is
org.apache.spark.deploy.yarn.ApplicationMaster
,for MapReduce job ,
it is org.apache.hadoop.mapreduce.v2.app.MRAppMaster
.
Yarn allocate some resource for the ApplicationMaster process and start the ApplicationMaster process in one of the cluster nodes;
After ApplicationMaster starts, ApplicationMaster will request resource from Yarn for this Application and start up worker;
For Spark, the distributed computing framework, a computing job is divided into many small tasks and each Executor will be responsible for each task, the Driver will collect the result of all Executor tasks and get a global result. A spark application has only one driver with multiple executors.
So, then ,the problem comes when Spark is using Yarn as a resource management tool in a cluster:
In Yarn Cluster Mode, Spark client will submit spark application to yarn, both Spark Driver and Spark Executor are under the supervision of yarn. In yarn's perspective, Spark Driver and Spark Executor have no difference, but normal java processes, namely an application worker process. So, when the client process is gone , e.g. the client process is terminated or killed, the Spark Application on yarn is still running.
In yarn client mode, only the Spark Executor are under the
supervision of yarn. The Yarn ApplicationMaster will request resource
for just spark executor. The driver program is running in the client
process which have nothing to do with yarn, just a process submitting
application to yarn.So ,when the client leave, e.g. the client
process exits, the Driver is down and the computing terminated.
So in spark you have two different components. There is the driver and the workers. In yarn-cluster mode the driver is running remotely on a data node and the workers are running on separate data nodes. In yarn-client mode the driver is on the machine that started the job and the workers are on the data nodes. In local mode the driver and workers are on the machine that started the job.
When you run .collect() the data from the worker nodes get pulled into the driver. It's basically where the final bit of processing happens.
For my self i have found yarn-cluster mode to be better when i'm at home on the vpn, but yarn-client mode is better when i'm running code from within the data center.
Yarn-client mode also means you tie up one less worker node for the driver.
A Spark application running in
driver program runs in client machine or local machine where the application has been launched.
Resource allocation is done by YARN resource manager based on data locality on data nodes and driver program from local machine will control the executors on spark cluster (Node managers).
Please refer this cloudera article for more info.
The difference between standalone mode and yarn deployment mode,
With yarn-client mode, your spark application is running in your local machine. With yarn-standalone mode, your spark application would be submitted to YARN's ResourceManager as yarn ApplicationMaster, and your application is running in a yarn node where ApplicationMaster is running. In both case, yarn serve as spark's cluster manager. Your application(SparkContext) send tasks to yarn.
A Spark application consists of a driver and one or many executors. The driver program is the main program (where you instantiate SparkContext
), which coordinates the executors to run the Spark application. The executors run tasks assigned by the driver.
A YARN application has the following roles: yarn client, yarn application master and list of containers running on the node managers.
When Spark application runs on YARN, it has its own implementation of yarn client and yarn application master.
With those background, the major difference is where the driver program runs.
Reference: http://spark.incubator.apache.org/docs/latest/cluster-overview.html
First of all, let's make clear what's the difference between running Spark in standalone mode and running Spark on a cluster manager (Mesos or YARN).
When running Spark in standalone mode, you have:
So:
When using a cluster manager (I will describe for YARN which is the most common case), you have :
Note that there are 2 modes in that case: cluster-mode
and client-mode
. In the client mode, which is the one you mentioned:
So, back to your questions:
What does it mean "launched locally"? Locally where? On the Spark cluster?
Locally means in the server in which you are executing the command (which could be a spark-submit
or a spark-shell
). That means that you could possibly run it in the cluster's master node or you could also run it in a server outside the cluster (e.g. your laptop) as long as the appropriate configuration is in place, so that this server can communicate with the cluster and vice-versa.
What is the specific difference from the yarn-standalone mode?
As described above, the difference is that in the standalone mode, there is no cluster manager at all. A more elaborate analysis and categorisation of all the differences concretely for each mode is available in this article.