Pyspark application only partly exploits dataproc cluster resources

问题

My pyspark application runs a UDF over a 106,36 MB dataset (817.270 records), which takes about 100hours with regular python lambda functions. I have spawned a Google Dataproc cluster with 20 worker nodes with 8 vCPU's each. However, upon execution only 3 nodes and 3 vCPU's in total are used. Obviously, I would like the cluster to use all the resources that I make available.

The default number of partitions of my resulting dataframe is 8. I tried repartitioning it to 100 but the cluster keeps using only 3 nodes and 3 vCPU's. Also, when I run a command to check the number of executors that spark sees, it is only 3.

This is the pyspark code that gets executed:

from pyspark.sql.types import StringType, MapType
from pyspark.sql.functions import udf

customer_names = spark.createDataFrame(customer_names)

embargo_match_udf = udf(lambda x,y: embargoMatch(x,y), MapType(StringType(), StringType()))
customer_names = customer_names.withColumn('JaroDistance', embargo_match_udf('name','customer_code'))



result = customer_names.withColumn('jaro_similarity', customer_names.JaroDistance['max_jaro'])

result.write.format("com.databricks.spark.csv").save('gs://charles-embargo-bucket/sparkytuesday')

Here is some spark output as seen from my jupyter notebook

print(sc) -> <SparkContext master=yarn appName=PySparkShell>
print(result.rdd.getNumPartitions()) -> 8
result = result.repartition(100)
print(result.rdd.getNumPartitions()) -> 100
sc._jsc.sc().getExecutorMemoryStatus().size() -> 3

回答1:

For those interested in how I solved the issue:

By default my spark context assumed two worker nodes, no matter how many extra nodes I spawned in Google Cloud's Dataproc UI. Therefore I manually changed the Spark Context as follows:

from pyspark.sql import SparkSession
from pyspark import SparkContext
from pyspark.conf import SparkConf

    sc.stop()
    SparkContext.setSystemProperty('spark.executor.cores', '4')
    SparkContext.setSystemProperty('spark.executor.instances', '5')
    sc = SparkContext("yarn", "embargotest")
    spark = SparkSession.builder.appName('embargotest').getOrCreate()

In addition, I explicitly partitioned the customer_names dataset to 20 (4 cores x 5 instances) before I apply the .withColumn function to this dataframe.

customer_names = spark.createDataFrame(customer_names).repartition(20)

Hope this can help someone with a similar issue!

回答2:

Additionally, you may want to try the following to have PySpark dynamically adjust the number of executors in your application via Dynamic Allocation:

SparkContext.setSystemProperty("spark.dynamicAllocation.enabled", "true")

来源：https://stackoverflow.com/questions/57161493/pyspark-application-only-partly-exploits-dataproc-cluster-resources

标签

python-2.7

apache-spark

Hadoop

google-cloud-dataproc