Duplicate partition columns on write s3

旧城冷巷雨未停 提交于 2020-06-12 15:39:12

问题


I'm processing data and writing it to s3 using the following code:

spark = SparkSession.builder.config('spark.sql.sources.partitionOverwriteMode', 'dynamic').getOrCreate()
df = spark.read.parquet('s3://<some bucket>/<some path>').filter(F.col('processing_hr') == <val>)
transformed_df = do_lots_of_transforms(df)

# here's the important bit on how I'm writing it out
transformed_df.write.mode('overwrite').partitionBy('processing_hr').parquet('s3://bucket_name/location')

Basically, I'm trying to overwrite partitions with what I have in the data frame, but leave the previously processed partitions in s3.

This write continues to happen, but randomly fails with some consistency. The write silently fails. When I read the data back in from 's3://bucket_name/location', I get the following error:

py4j.protocol.Py4JJavaError: An error occurred while calling o85.parquet.
: java.lang.AssertionError: assertion failed: Conflicting partition column names detected:

    Partition column name list #0: processing_hr, processing_hr
    Partition column name list #1: processing_hr

For partitioned table directories, data files should only live in leaf directories.
And directories at the same level should have the same partition column name.
Please check the following directories for unexpected files or inconsistent partition column names:

    s3://bucket_name/location/processing_hr=2019-09-19 13%3A00%3A00
    s3://bucket_name/location/processing_hr=2019-09-19 20%3A00%3A00
    s3://bucket_name/location/processing_hr=2019-09-19 12%3A00%3A00/processing_hr=2019-09-19 12%3A00%3A00

I'm kind of baffled as to how this could happen. How do I keep spark from duplicating my partition column?

I've tried looking around at docs, the spark jira, and still can't seem to find anything even remotely related to this.


回答1:


that is seems to be due to S3 eventual inconsistency issue . if using EMR < 5.30 use EMRFS consistent view . or else upgrade to EMR 5.30, in which latest EMRFS seems to have addressed this issue.



来源:https://stackoverflow.com/questions/58060209/duplicate-partition-columns-on-write-s3

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!