I have a quite hefty parquet file where I need to change values for one of the column. One way to do this would be to update those values in source text files and recreate parqu
There are workarounds, but you need to create your parquet file in a certain way to make it easier to update.
Best practices:
A. Use row groups to create parquet files. You need to optimize how many rows of data can go into a row group before features like data compression and dictionary encoding stop kicking in.
B. Scan row groups one at a time and figure out which row groups need to be updated. Generate new parquet files with amended data for each modified row group. It is more memory efficient to work with one row group's worth of data at a time instead of everything in the file.
C. Rebuild the original parquet file by appending unmodified row groups and with modified row groups generated by reading in one parquet file per row group.
it's surprisingly fast to reassemble a parquet file using row groups.
In theory it should be easy to append to existing parquet file if you just strip the footer (stats info), append new row groups and add new footer with update stats, but there isn't an API / Library that supports it..
You must re-create the file, this is the Hadoop way. Especially if the file is compressed.
Another approach, (very common in Big-data), is to do the update on another Parquet (or ORC) file, then JOIN / UNION at query time.
Look at this nice blog which can answer your question and provide a method to perform updates using Spark(Scala):
http://aseigneurin.github.io/2017/03/14/incrementally-loaded-parquet-files.html
Copy & Paste from the blog:
when we need to edit the data, in our data structures (Parquet), that are immutable.
You can add partitions to Parquet files, but you can’t edit the data in place.
But ultimately we can mutate the data, we just need to accept that we won’t be doing it in place. We will need to recreate the Parquet files using a combination of schemas and UDFs to correct the bad data.
If you want to incrementally append the data in Parquet (you did n't ask this question, still it would be useful for other readers) Refer this well written blog:
http://aseigneurin.github.io/2017/03/14/incrementally-loaded-parquet-files.html
Disclaimer: I have n't written those blogs, I just read it and found it might be useful for others.
Lets start with basics:
Parquet is a file format that needs to be saved in a file system.
Key questions:
append
operations?append
on files? append
operations?Answers:
parquet.hadoop.ParquetFileWriter
only supports CREATE
and OVERWRITE
; there is no append
mode. (Not sure but this could potentially change in other implementations -- parquet design does support append
)
HDFS allows append
on files using the dfs.support.append
property
Spark framework does not support append
to existing parquet files, and with no plans to; see this JIRA
It is not a good idea to append to an existing file in distributed systems, especially given we might have two writers at the same time.
More details are here:
http://bytepadding.com/big-data/spark/read-write-parquet-files-using-spark/
http://bytepadding.com/linux/understanding-basics-of-filesystem/