Updating values in apache parquet file

后端 未结 4 1788
醉话见心
醉话见心 2021-02-07 08:54

I have a quite hefty parquet file where I need to change values for one of the column. One way to do this would be to update those values in source text files and recreate parqu

4条回答
  •  爱一瞬间的悲伤
    2021-02-07 09:18

    Lets start with basics:

    Parquet is a file format that needs to be saved in a file system.

    Key questions:

    1. Does parquet support append operations?
    2. Does the file system (namely, HDFS) allow append on files?
    3. Can the job framework (Spark) implement append operations?

    Answers:

    1. parquet.hadoop.ParquetFileWriter only supports CREATE and OVERWRITE; there is no append mode. (Not sure but this could potentially change in other implementations -- parquet design does support append)

    2. HDFS allows append on files using the dfs.support.append property

    3. Spark framework does not support append to existing parquet files, and with no plans to; see this JIRA

    It is not a good idea to append to an existing file in distributed systems, especially given we might have two writers at the same time.

    More details are here:

    • http://bytepadding.com/big-data/spark/read-write-parquet-files-using-spark/

    • http://bytepadding.com/linux/understanding-basics-of-filesystem/

提交回复
热议问题