Append new data to partitioned parquet files

前端未结

关注

 2  2006

暗喜 2021-02-01 07:01

I am writing an ETL process where I will need to read hourly log files, partition the data, and save it. I am using Spark (in Databricks). The log files are CSV so I read them a

2条回答

无人及你 (楼主)

2021-02-01 08:03
If you need to append the files, you definitely have to use the append mode. I don't know how many partitions you expect it to generate, but I find that if you have many partitions, partitionBy will cause a number of problems (memory- and IO-issues alike).

If you think that your problem is caused by write operations taking too long, I recommend that you try these two things:

1) Use snappy by adding to the configuration:
```
conf.set("spark.sql.parquet.compression.codec", "snappy")
```
2) Disable generation of the metadata files in the hadoopConfiguration on the SparkContext like this:
```
sc.hadoopConfiguration.set("parquet.enable.summary-metadata", "false")
```
The metadata-files will be somewhat time consuming to generate (see this blog post), but according to this they are not actually important. Personally, I always disable them and have no issues.

If you generate many partitions (> 500), I'm afraid the best I can do is suggest to you that you look into a solution not using append-mode - I simply never managed to get partitionBy to work with that many partitions.
0 讨论(0)

查看其它2个回答
发布评论:

提交评论
- 加载中...