I\'m running some operations in PySpark, and recently increased the number of nodes in my configuration (which is on Amazon EMR). However, even though I tripled the number
On pyspark you could still call the scala getExecutorMemoryStatus
API using pyspark's py4j bridge:
sc._jsc.sc().getExecutorMemoryStatus().size()
I found sometimes my sessions were killed by the remote giving a strange Java error
Py4JJavaError: An error occurred while calling o349.defaultMinPartitions.
: java.lang.IllegalStateException: Cannot call methods on a stopped SparkContext.
I avoided this by the following
def check_alive(spark_conn):
"""Check if connection is alive. ``True`` if alive, ``False`` if not"""
try:
get_java_obj = spark_conn._jsc.sc().getExecutorMemoryStatus()
return True
except Exception:
return False
def get_number_of_executors(spark_conn):
if not check_alive(spark_conn):
raise Exception('Unexpected Error: Spark Session has been killed')
try:
return spark_conn._jsc.sc().getExecutorMemoryStatus().size()
except:
raise Exception('Unknown error')
The other answers provide a way to get the number of executors. Here is a way to get the number of nodes. This includes head and worker nodes.
s = sc._jsc.sc().getExecutorMemoryStatus().keys()
l = str(s).replace("Set(","").replace(")","").split(", ")
d = set()
for i in l:
d.add(i.split(":")[0])
len(d)
sc.defaultParallelism
is just a hint. Depending on the configuration it may not have a relation to the number of nodes. This is the number of partitions if you use an operation that takes a partition count argument but you don't provide it. For example sc.parallelize
will make a new RDD from a list. You can tell it how many partitions to create in the RDD with the second argument. But the default value for this argument is sc.defaultParallelism
.
You can get the number of executors with sc.getExecutorMemoryStatus
in the Scala API, but this is not exposed in the Python API.
In general the recommendation is to have around 4 times as many partitions in an RDD as you have executors. This is a good tip, because if there is variance in how much time the tasks take this will even it out. Some executors will process 5 faster tasks while others process 3 slower tasks, for example.
You don't need to be very accurate with this. If you have a rough idea, you can go with an estimate. Like if you know you have less than 200 CPUs, you can say 500 partitions will be fine.
So try to create RDDs with this number of partitions:
rdd = sc.parallelize(data, 500) # If distributing local data.
rdd = sc.textFile('file.csv', 500) # If loading data from a file.
Or repartition the RDD before the computation if you don't control the creation of the RDD:
rdd = rdd.repartition(500)
You can check the number of partitions in an RDD with rdd.getNumPartitions()
.
It should be possible to get the number of nodes in the cluster using this (similar to @Dan's method above, but shorter and works better!).
sc._jsc.sc().getExecutorMemoryStatus().keySet().size()