Once in 2 hours, spark job is running to convert some tgz files to parquet. The job appends the new data into an existing parquet in s3:
df.write.mode(\"appe
Switch this over to using Dynamic Partition Overwrite Mode using:
.config("spark.sql.sources.partitionOverwriteMode", "dynamic")
Also, avoid the DirectParquetOutputCommitter, and instead don't modify this - you will achieve better results in terms of speed using the EMRFS File Committer.
DirectParquetOutputCommitter