Append new data to partitioned parquet files

前端 未结 2 2001
暗喜
暗喜 2021-02-01 07:01

I am writing an ETL process where I will need to read hourly log files, partition the data, and save it. I am using Spark (in Databricks). The log files are CSV so I read them a

2条回答
  •  无人及你
    2021-02-01 08:03

    If you need to append the files, you definitely have to use the append mode. I don't know how many partitions you expect it to generate, but I find that if you have many partitions, partitionBy will cause a number of problems (memory- and IO-issues alike).

    If you think that your problem is caused by write operations taking too long, I recommend that you try these two things:

    1) Use snappy by adding to the configuration:

    conf.set("spark.sql.parquet.compression.codec", "snappy")
    

    2) Disable generation of the metadata files in the hadoopConfiguration on the SparkContext like this:

    sc.hadoopConfiguration.set("parquet.enable.summary-metadata", "false")
    

    The metadata-files will be somewhat time consuming to generate (see this blog post), but according to this they are not actually important. Personally, I always disable them and have no issues.

    If you generate many partitions (> 500), I'm afraid the best I can do is suggest to you that you look into a solution not using append-mode - I simply never managed to get partitionBy to work with that many partitions.

提交回复
热议问题