Creating hive table using parquet file metadata

后端 未结 6 1453
面向向阳花
面向向阳花 2021-02-01 11:01

I wrote a DataFrame as parquet file. And, I would like to read the file using Hive using the metadata from parquet.

Output from writing parquet write

_co         


        
6条回答
  •  后悔当初
    2021-02-01 11:33

    I had the same question. It might be hard to implement from pratcical side though, as Parquet supports schema evolution:

    http://www.cloudera.com/content/www/en-us/documentation/archive/impala/2-x/2-0-x/topics/impala_parquet.html#parquet_schema_evolution_unique_1

    For example, you could add a new column to your table and you don't have to touch data that's already in the table. It's only new datafiles will have new metadata (compatible with previous version).

    Schema merging is switched off by default since Spark 1.5.0 since it is "relatively expensive operation" http://spark.apache.org/docs/latest/sql-programming-guide.html#schema-merging So infering most recent schema may not be as simple as it sounds. Although quick-and-dirty approaches are quite possible e.g. by parsing output from

    $ parquet-tools schema /home/gz_files/result/000000_0
    

提交回复
热议问题