Spark - Read and Write back to same S3 location

问题

I am reading a dataset dataset1 and dataset2 from S3 locations. I then transform them and write back to the same location where dataset2 was read from.

However, I get below error message:

An error occurred while calling o118.save. No such file or directory 's3://<myPrefix>/part-00001-a123a120-7d11-581a-b9df-bc53076d57894-c000.snappy.parquet

If I try to write to a new S3 location e.g. s3://dataset_new_path.../ then the code works fine.

my_df \
  .write.mode('overwrite') \
  .format('parquet') \
  .save(s3_target_location)

Note: I have tried using .cache() after reading in the dataframe but still get the same error.

回答1:

The reason this causes a problem is that you are reading and writing to the same path that you are trying to overwrite. It is standard Spark issue and nothing to do with AWS Glue.

Spark uses lazy transformation on DF and it is triggered when certain action is called. It creates DAG to keep information about all transformations which should be applied to DF.

When you read data from same location and write using override, 'write using override' is action for DF. When spark sees 'write using override', in it's execution plan it adds to delete the path first, then trying to read that path which is already vacant; hence error.

Possible workaround would be to write to some temp location first and then using it as source, override in dataset2 location

来源：https://stackoverflow.com/questions/58362511/spark-read-and-write-back-to-same-s3-location

标签

apache-spark

amazon-s3

pyspark

aws-glue

易学教程内所有资源均来自网络或用户发布的内容，如有违反法律规定的内容欢迎反馈！
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!