Spark 2.0 deprecates 'DirectParquetOutputCommitter', how to live without it?

前端未结

关注

 2  1827

Recently we migrated from \"EMR on HDFS\" --> \"EMR on S3\" (EMRFS with consistent view enabled) and we realized the Spark \'SaveAsTable\' (parquet format) writes to S3 were ~4x

相关标签:

2条回答

没有蜡笔的小新

2021-01-31 11:04

I think the S3 committer from Netflix is already open sourced at: https://github.com/rdblue/s3committer.

0 讨论(0)
发布评论:

提交评论
- 加载中...
悲哀的现实

2021-01-31 11:24

You can use: sparkContext.hadoopConfiguration.set("mapreduce.fileoutputcommitter.algorithm.version", "2")

since you are on EMR just use s3 (no need for s3a)

We are using Spark 2.0 and writing Parquet to S3 pretty fast (about as fast as HDFS)

if you want to read more check out this jira ticket SPARK-10063

0 讨论(0)
发布评论:

提交评论
- 加载中...