问题
I'm new to PySpark and I'm trying to use pySpark (ver 2.3.1) on my local computer with Jupyter-Notebook.
I want to set spark.driver.memory to 9Gb by doing this:
spark = SparkSession.builder \
.master("local[2]") \
.appName("test") \
.config("spark.driver.memory", "9g")\
.getOrCreate()
sc = spark.sparkContext
from pyspark.sql import SQLContext
sqlContext = SQLContext(sc)
spark.sparkContext._conf.getAll() # check the config
It returns
[('spark.driver.memory', '9g'),
('spark.driver.cores', '4'),
('spark.rdd.compress', 'True'),
('spark.driver.port', '15611'),
('spark.serializer.objectStreamReset', '100'),
('spark.app.name', 'test'),
('spark.executor.id', 'driver'),
('spark.submit.deployMode', 'client'),
('spark.ui.showConsoleProgress', 'true'),
('spark.master', 'local[2]'),
('spark.app.id', 'local-xyz'),
('spark.driver.host', '0.0.0.0')]
It's quite of weird because when I look at the document, it shows that
Note: In client mode, this config must not be set through the SparkConf directly in your application, because the driver JVM has already started at that point. Instead, please set this through the --driver-memory command line option or in your default properties file. document here
But, as you see in the result above, it returns
[('spark.driver.memory', '9g')
Even when I access to the spark web UI (on port 4040, environment tab), it still shows
I tried one more time, with 'spark.driver.memory', '10g'
. The web UI and spark.sparkContext._conf.getAll()
returned '10g'.
I'm so confused about that.
My questions are:
Is the document right about
spark.driver.memory
configIf the document is right, is there a proper way that I can check
spark.driver.memory
after config. I triedspark.sparkContext._conf.getAll()
as well as Spark web UI but it seems to lead to a wrong answer.
回答1:
You provided the following code.
spark = SparkSession.builder \
.master("local[2]") \
.appName("test") \
.config("spark.driver.memory", "9g")\ # This will work (Not recommended)
.getOrCreate()
sc = spark.sparkContext
from pyspark.sql import SQLContext
sqlContext = SQLContext(sc)
This config must not be set through the
SparkConf
directly
means you can set the driver memory, but it it is not recommended at RUN TIME. Hence, if you set it using spark.driver.memory
, it accepts the change and overrides it. But, this is not recommended. So, that particular comment ** this config must not be set through the SparkConf directly** does not apply in the documentation. You can tell the JVM to instantiate itself (JVM) with 9g
of driver memory by using SparkConf
.
Now, if you go by this line (Spark is fine with this)
Instead, please set this through the --driver-memory, it implies that
when you are trying to submit a Spark
job against client
, you can set the driver memory by using --driver-memory
flag, say
spark-submit --deploy-mode client --driver-memory 12G
Now the line ended with the following phrase
or in your default properties file.
You can tell SPARK
in your environment to read the default settings from SPARK_CONF_DIR
or $SPARK_HOME/conf
where the driver-memory
can be configured. Spark is also fine with this.
To answer your second part
If the document is right, is there a proper way that I can check spark.driver.memory after config. I tried spark.sparkContext._conf.getAll() as well as Spark web UI but it seems to lead to a wrong answer."
I would like to say that the documentation is right. You can check the driver memory by using or eventually for what you have specified about spark.sparkContext._conf.getAll()
works too.
>>> sc._conf.get('spark.driver.memory')
u'12g' # which is 12G for the driver I have used
To conclude about the documentation. You can set the `spark.driver.memory' in the
spark-shell
,Jupyter Notebook
or any other environment where you already initializedSpark
(Not Recommended).spark-submit
command (Recommended)SPARK_CONF_DIR
orSPARK_HOME/conf
(Recommended)You can start
spark-shell
by specifyingspark-shell --driver-memory 9G
For more information refer,
Default Spark Properties File
回答2:
Yes, the documentation is correct. The memory needs to be specified before the JVM starts. After JVM starts, even if you change the value of the property programmatically inside the application, it won't reset the memory allocated by JVM. You can verify the driver memory allocated and used from Spark UI "Executors" tab.
Note: In client mode, this config must not be set through the SparkConf directly in your application, because the driver JVM has already started at that point.
回答3:
Setting spark.driver.memory
through SparkSession.builder.config
only works if the driver JVM hasn't been started before.
To prove it, first run the following code against a fresh Python intepreter:
spark = SparkSession.builder.config("spark.driver.memory", "512m").getOrCreate()
spark.range(10000000).collect()
The code throws java.lang.OutOfMemoryError: GC overhead limit exceeded
as 10M rows won't fit into 512m driver. However if you try that with 2g memory (again, with fresh Python interpreter):
spark = SparkSession.builder.config("spark.driver.memory", "2g").getOrCreate()
spark.range(10000000).collect()
the code works just fine. Now, you'd expect this:
spark = SparkSession.builder.config("spark.driver.memory", "512m").getOrCreate()
spark.stop() # to set new configs, you must first stop the running session
spark = SparkSession.builder.config("spark.driver.memory", "2g").getOrCreate()
spark.range(10000000).collect()
to run without errors, as your session's spark.driver.memory
is seemingly set to 2g. However, you get java.lang.OutOfMemoryError: GC overhead limit exceeded
, which means your driver memory is still 512m! The driver memory wasn't updated because the driver JVM was already started when it received the new config. Interestingly, if you read spark's config with spark.sparkContext.getConf().getAll()
(or from Spark UI), it tells you your driver memory is 2g, which is obviously not true.
Thus the official spark documentation (https://spark.apache.org/docs/2.4.5/configuration.html#application-properties) is right when it says you should set driver memory through the --driver-memory command line option or in your default properties file.
来源:https://stackoverflow.com/questions/53606756/how-to-set-spark-driver-memory-in-client-mode-pyspark-version-2-3-1