Spark: unusually slow data write to Cloud Storage

问题

As the final stage of the pyspark job, I need to save 33Gb of data to Cloud Storage.

My cluster is on Dataproc and consists of 15 n1-standard-v4 workers. I'm working with avro and the code I use to save the data:

df = spark.createDataFrame(df.rdd, avro_schema_str)
df \
   .write \
   .format("avro") \
   .partitionBy('<field_with_<5_unique_values>', 'field_with_lots_of_unique_values>') \
   .save(f"gs://{output_path}")

The write stage stats from the UI:

My worker stats:

Quite strangely for the adequate partition size, the shuffle spill is huge:

The questions I want to ask are the following:

if the stage takes 1.3h and workers do their work for 25 mins - does it imply that driver does the 50-min write to GCS?
What causes a shuffle spill of this size given that no caching, persisting was used?
Why does the stage take so long?

UPDATE:

The SQL tab:

来源：https://stackoverflow.com/questions/65349126/spark-unusually-slow-data-write-to-cloud-storage

标签

apache-spark