getting number of visible nodes in PySpark

前端未结

关注

 5  654

I\'m running some operations in PySpark, and recently increased the number of nodes in my configuration (which is on Amazon EMR). However, even though I tripled the number

相关标签:

5条回答

自闭症患者

2020-12-24 08:14
On pyspark you could still call the scala getExecutorMemoryStatus API using pyspark's py4j bridge:
```
sc._jsc.sc().getExecutorMemoryStatus().size()
```
0 讨论(0)
发布评论:

提交评论
- 加载中...

暖寄归人

2020-12-24 08:16

I found sometimes my sessions were killed by the remote giving a strange Java error

Py4JJavaError: An error occurred while calling o349.defaultMinPartitions.
: java.lang.IllegalStateException: Cannot call methods on a stopped SparkContext.

I avoided this by the following

def check_alive(spark_conn):
    """Check if connection is alive. ``True`` if alive, ``False`` if not"""
    try:
        get_java_obj = spark_conn._jsc.sc().getExecutorMemoryStatus()
        return True
    except Exception:
        return False

def get_number_of_executors(spark_conn):
    if not check_alive(spark_conn):
        raise Exception('Unexpected Error: Spark Session has been killed')
    try:
        return spark_conn._jsc.sc().getExecutorMemoryStatus().size()
    except:
        raise Exception('Unknown error')

0 讨论(0)

南旧

2020-12-24 08:19
The other answers provide a way to get the number of executors. Here is a way to get the number of nodes. This includes head and worker nodes.
```
s = sc._jsc.sc().getExecutorMemoryStatus().keys()
l = str(s).replace("Set(","").replace(")","").split(", ")

d = set()
for i in l:
    d.add(i.split(":")[0])
len(d)  
```
0 讨论(0)
发布评论:

提交评论
- 加载中...
时光说笑

2020-12-24 08:21
sc.defaultParallelism is just a hint. Depending on the configuration it may not have a relation to the number of nodes. This is the number of partitions if you use an operation that takes a partition count argument but you don't provide it. For example sc.parallelize will make a new RDD from a list. You can tell it how many partitions to create in the RDD with the second argument. But the default value for this argument is sc.defaultParallelism.

You can get the number of executors with sc.getExecutorMemoryStatus in the Scala API, but this is not exposed in the Python API.

In general the recommendation is to have around 4 times as many partitions in an RDD as you have executors. This is a good tip, because if there is variance in how much time the tasks take this will even it out. Some executors will process 5 faster tasks while others process 3 slower tasks, for example.

You don't need to be very accurate with this. If you have a rough idea, you can go with an estimate. Like if you know you have less than 200 CPUs, you can say 500 partitions will be fine.

So try to create RDDs with this number of partitions:
```
rdd = sc.parallelize(data, 500)     # If distributing local data.
rdd = sc.textFile('file.csv', 500)  # If loading data from a file.
```
Or repartition the RDD before the computation if you don't control the creation of the RDD:
```
rdd = rdd.repartition(500)
```
You can check the number of partitions in an RDD with rdd.getNumPartitions().
0 讨论(0)
发布评论:

提交评论
- 加载中...
温柔的废话

2020-12-24 08:29
It should be possible to get the number of nodes in the cluster using this (similar to @Dan's method above, but shorter and works better!).
```
sc._jsc.sc().getExecutorMemoryStatus().keySet().size()
```
0 讨论(0)
发布评论:

提交评论
- 加载中...