问题
We have generated a parquet
file in Dask
(Python) and with Drill
(R using the Sergeant
packet ). We have noticed a few issues:
- The format of the
Dask
(i.e.fastparquet
) has a_metadata
and a_common_metadata
files while theparquet
file inR \ Drill
does not have these files and haveparquet.crc
files instead (which can be deleted). what is the difference between theseparquet
implementations?
回答1:
(only answering to 1), please post separate questions to make it easier to answer)
_metadata
and _common_metadata
are helper files that are not required for a Parquet dataset, these ones are used by Spark/Dask/Hive/... to infer the metadata of all Parquet files of a dataset without the need to read the footer of all files. In constrast to this, Apache Drill generates a similar file in each folder (on demand) that contains all footers of all Parquet files. Only on the first query on a dataset all files are read, further queries will only read the file that caches all footers.
Tools using _metadata
and _common_metadata
should be able to leverage them to have faster execution times but not depend on them for operations. In the case that they are non-existent, the query engine then simply needs to read all footers.
来源:https://stackoverflow.com/questions/45415829/generating-parquet-files-differences-between-r-and-python