Spark: unusually slow data write to Cloud Storage

大憨熊 提交于 2021-01-07 01:24:25

问题


As the final stage of the pyspark job, I need to save 33Gb of data to Cloud Storage.

My cluster is on Dataproc and consists of 15 n1-standard-v4 workers. I'm working with avro and the code I use to save the data:

df = spark.createDataFrame(df.rdd, avro_schema_str)
df \
   .write \
   .format("avro") \
   .partitionBy('<field_with_<5_unique_values>', 'field_with_lots_of_unique_values>') \
   .save(f"gs://{output_path}")

The write stage stats from the UI:

My worker stats:

Quite strangely for the adequate partition size, the shuffle spill is huge:

The questions I want to ask are the following:

  1. if the stage takes 1.3h and workers do their work for 25 mins - does it imply that driver does the 50-min write to GCS?

  2. What causes a shuffle spill of this size given that no caching, persisting was used?

  3. Why does the stage take so long?

UPDATE:

The SQL tab:

来源:https://stackoverflow.com/questions/65349126/spark-unusually-slow-data-write-to-cloud-storage

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!