I am newbie with spark
and pyspark
. I will appreciate if somebody explain what exactly does SparkContext
parameter do? And how could I set
The first thing a Spark program must do is to create a SparkContext object, which tells Spark how to access a cluster. To create a SparkContext you first need to build a SparkConf object that contains information about your application.
If you are running pyspark i.e. shell then Spark automatically creates the SparkContext object for you with the name sc
. But if you are writing your python program you have to do something like
from pyspark import SparkContext
sc = SparkContext(appName = "test")
Any configuration would go into this spark context object like setting the executer memory or the number of core.
These parameters can also be passed from the shell while invoking for example
./bin/spark-submit --class org.apache.spark.examples.SparkPi \
--master yarn-cluster \
--num-executors 3 \
--driver-memory 4g \
--executor-memory 2g \
--executor-cores 1
lib/spark-examples*.jar \
10
For passing parameters to pyspark use something like this
./bin/pyspark --num-executors 17 --executor-cores 5 --executor-memory 8G