Append new data to partitioned parquet files

前端未结

关注

 2  2007

I am writing an ETL process where I will need to read hourly log files, partition the data, and save it. I am using Spark (in Databricks). The log files are CSV so I read them a

相关标签:

2条回答

天命终不由人

2021-02-01 07:47
If you're using unsorted partitioning your data is going to be split across all of your partitions. That means every task will generate and write data to each of your output files.

Consider repartitioning your data according to your partition columns before writing to have all the data per output file on the same partitions:
```
data
 .filter(validPartnerIds($"partnerID"))
 .repartition([optional integer,] "partnerID","year","month","day")
 .write
 .partitionBy("partnerID","year","month","day")
 .parquet(saveDestination)
```
See: DataFrame.repartition
0 讨论(0)
发布评论:

提交评论
- 加载中...
无人及你

2021-02-01 08:03
If you need to append the files, you definitely have to use the append mode. I don't know how many partitions you expect it to generate, but I find that if you have many partitions, partitionBy will cause a number of problems (memory- and IO-issues alike).

If you think that your problem is caused by write operations taking too long, I recommend that you try these two things:

1) Use snappy by adding to the configuration:
```
conf.set("spark.sql.parquet.compression.codec", "snappy")
```
2) Disable generation of the metadata files in the hadoopConfiguration on the SparkContext like this:
```
sc.hadoopConfiguration.set("parquet.enable.summary-metadata", "false")
```
The metadata-files will be somewhat time consuming to generate (see this blog post), but according to this they are not actually important. Personally, I always disable them and have no issues.

If you generate many partitions (> 500), I'm afraid the best I can do is suggest to you that you look into a solution not using append-mode - I simply never managed to get partitionBy to work with that many partitions.
0 讨论(0)
发布评论:

提交评论
- 加载中...