Updating values in apache parquet file

后端未结

关注

 4  1791

醉话见心 2021-02-07 08:54

I have a quite hefty parquet file where I need to change values for one of the column. One way to do this would be to update those values in source text files and recreate parqu

4条回答

爱一瞬间的悲伤 (楼主)

2021-02-07 09:18
Lets start with basics:

Parquet is a file format that needs to be saved in a file system.

Key questions:
1. Does parquet support append operations?
2. Does the file system (namely, HDFS) allow append on files?
3. Can the job framework (Spark) implement append operations?
Answers:
1. parquet.hadoop.ParquetFileWriter only supports CREATE and OVERWRITE; there is no append mode. (Not sure but this could potentially change in other implementations -- parquet design does support append)
2. HDFS allows append on files using the dfs.support.append property
3. Spark framework does not support append to existing parquet files, and with no plans to; see this JIRA
It is not a good idea to append to an existing file in distributed systems, especially given we might have two writers at the same time.

More details are here:
- http://bytepadding.com/big-data/spark/read-write-parquet-files-using-spark/
- http://bytepadding.com/linux/understanding-basics-of-filesystem/
0 讨论(0)

查看其它4个回答
发布评论:

提交评论
- 加载中...