问题
As the final stage of the pyspark job, I need to save 33Gb of data to Cloud Storage.
My cluster is on Dataproc and consists of 15 n1-standard-v4 workers. I'm working with avro and the code I use to save the data:
df = spark.createDataFrame(df.rdd, avro_schema_str)
df \
.write \
.format("avro") \
.partitionBy('<field_with_<5_unique_values>', 'field_with_lots_of_unique_values>') \
.save(f"gs://{output_path}")
The write stage stats from the UI:
My worker stats:
Quite strangely for the adequate partition size, the shuffle spill is huge:
The questions I want to ask are the following:
if the stage takes 1.3h and workers do their work for 25 mins - does it imply that driver does the 50-min write to GCS?
What causes a shuffle spill of this size given that no caching, persisting was used?
Why does the stage take so long?
UPDATE:
The SQL tab:
来源:https://stackoverflow.com/questions/65349126/spark-unusually-slow-data-write-to-cloud-storage