How to set `spark.driver.memory` in client mode - pyspark (version 2.3.1)

[亡魂溺海] 提交于 2020-08-19 05:33:05

问题


I'm new to PySpark and I'm trying to use pySpark (ver 2.3.1) on my local computer with Jupyter-Notebook.

I want to set spark.driver.memory to 9Gb by doing this:

spark = SparkSession.builder \
       .master("local[2]") \
       .appName("test") \
       .config("spark.driver.memory", "9g")\
       .getOrCreate()
sc = spark.sparkContext
from pyspark.sql import SQLContext
sqlContext = SQLContext(sc)

spark.sparkContext._conf.getAll()  # check the config

It returns

[('spark.driver.memory', '9g'),
('spark.driver.cores', '4'),
('spark.rdd.compress', 'True'),
('spark.driver.port', '15611'),
('spark.serializer.objectStreamReset', '100'),
('spark.app.name', 'test'),
('spark.executor.id', 'driver'),
('spark.submit.deployMode', 'client'),
('spark.ui.showConsoleProgress', 'true'),
('spark.master', 'local[2]'),
('spark.app.id', 'local-xyz'),
('spark.driver.host', '0.0.0.0')]

It's quite of weird because when I look at the document, it shows that

Note: In client mode, this config must not be set through the SparkConf directly in your application, because the driver JVM has already started at that point. Instead, please set this through the --driver-memory command line option or in your default properties file. document here

But, as you see in the result above, it returns

[('spark.driver.memory', '9g')

Even when I access to the spark web UI (on port 4040, environment tab), it still shows

I tried one more time, with 'spark.driver.memory', '10g'. The web UI and spark.sparkContext._conf.getAll() returned '10g'. I'm so confused about that. My questions are:

  1. Is the document right about spark.driver.memory config

  2. If the document is right, is there a proper way that I can check spark.driver.memory after config. I tried spark.sparkContext._conf.getAll() as well as Spark web UI but it seems to lead to a wrong answer.


回答1:


You provided the following code.

spark = SparkSession.builder \
       .master("local[2]") \
       .appName("test") \
       .config("spark.driver.memory", "9g")\ # This will work (Not recommended)
       .getOrCreate()
sc = spark.sparkContext
from pyspark.sql import SQLContext
sqlContext = SQLContext(sc)

This config must not be set through the SparkConf directly

means you can set the driver memory, but it it is not recommended at RUN TIME. Hence, if you set it using spark.driver.memory, it accepts the change and overrides it. But, this is not recommended. So, that particular comment ** this config must not be set through the SparkConf directly** does not apply in the documentation. You can tell the JVM to instantiate itself (JVM) with 9g of driver memory by using SparkConf.

Now, if you go by this line (Spark is fine with this)

Instead, please set this through the --driver-memory, it implies that

when you are trying to submit a Spark job against client, you can set the driver memory by using --driver-memory flag, say

spark-submit --deploy-mode client --driver-memory 12G

Now the line ended with the following phrase

or in your default properties file.

You can tell SPARK in your environment to read the default settings from SPARK_CONF_DIR or $SPARK_HOME/conf where the driver-memory can be configured. Spark is also fine with this.

To answer your second part

If the document is right, is there a proper way that I can check spark.driver.memory after config. I tried spark.sparkContext._conf.getAll() as well as Spark web UI but it seems to lead to a wrong answer."

I would like to say that the documentation is right. You can check the driver memory by using or eventually for what you have specified about spark.sparkContext._conf.getAll() works too.

>>> sc._conf.get('spark.driver.memory')
u'12g' # which is 12G for the driver I have used

To conclude about the documentation. You can set the `spark.driver.memory' in the

  • spark-shell, Jupyter Notebook or any other environment where you already initialized Spark (Not Recommended).
  • spark-submit command (Recommended)
  • SPARK_CONF_DIR or SPARK_HOME/conf (Recommended)
  • You can start spark-shell by specifying

    spark-shell --driver-memory 9G

For more information refer,

Default Spark Properties File




回答2:


Yes, the documentation is correct. The memory needs to be specified before the JVM starts. After JVM starts, even if you change the value of the property programmatically inside the application, it won't reset the memory allocated by JVM. You can verify the driver memory allocated and used from Spark UI "Executors" tab.

Note: In client mode, this config must not be set through the SparkConf directly in your application, because the driver JVM has already started at that point.




回答3:


Setting spark.driver.memory through SparkSession.builder.config only works if the driver JVM hasn't been started before.

To prove it, first run the following code against a fresh Python intepreter:

spark = SparkSession.builder.config("spark.driver.memory", "512m").getOrCreate()
spark.range(10000000).collect()

The code throws java.lang.OutOfMemoryError: GC overhead limit exceeded as 10M rows won't fit into 512m driver. However if you try that with 2g memory (again, with fresh Python interpreter):

spark = SparkSession.builder.config("spark.driver.memory", "2g").getOrCreate()
spark.range(10000000).collect()

the code works just fine. Now, you'd expect this:

spark = SparkSession.builder.config("spark.driver.memory", "512m").getOrCreate()
spark.stop()  # to set new configs, you must first stop the running session 
spark = SparkSession.builder.config("spark.driver.memory", "2g").getOrCreate()
spark.range(10000000).collect()

to run without errors, as your session's spark.driver.memory is seemingly set to 2g. However, you get java.lang.OutOfMemoryError: GC overhead limit exceeded, which means your driver memory is still 512m! The driver memory wasn't updated because the driver JVM was already started when it received the new config. Interestingly, if you read spark's config with spark.sparkContext.getConf().getAll() (or from Spark UI), it tells you your driver memory is 2g, which is obviously not true.

Thus the official spark documentation (https://spark.apache.org/docs/2.4.5/configuration.html#application-properties) is right when it says you should set driver memory through the --driver-memory command line option or in your default properties file.



来源:https://stackoverflow.com/questions/53606756/how-to-set-spark-driver-memory-in-client-mode-pyspark-version-2-3-1

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!