How to avoid reading old files from S3 when appending new data?

后端 未结 2 621
陌清茗
陌清茗 2021-01-17 01:50

Once in 2 hours, spark job is running to convert some tgz files to parquet. The job appends the new data into an existing parquet in s3:

df.write.mode(\"appe         


        
相关标签:
2条回答
  • 2021-01-17 02:43

    Switch this over to using Dynamic Partition Overwrite Mode using:

    .config("spark.sql.sources.partitionOverwriteMode", "dynamic")
    

    Also, avoid the DirectParquetOutputCommitter, and instead don't modify this - you will achieve better results in terms of speed using the EMRFS File Committer.

    0 讨论(0)
  • 2021-01-17 02:56

    I resolved this issue by writing the dataframe to EMR HDFS and then using s3-dist-cp uploading the parquets to S3

    0 讨论(0)
提交回复
热议问题