I wrote a DataFrame as parquet file. And, I would like to read the file using Hive using the metadata from parquet.
Output from writing parquet write
_co
I had the same question. It might be hard to implement from pratcical side though, as Parquet supports schema evolution:
http://www.cloudera.com/content/www/en-us/documentation/archive/impala/2-x/2-0-x/topics/impala_parquet.html#parquet_schema_evolution_unique_1
For example, you could add a new column to your table and you don't have to touch data that's already in the table. It's only new datafiles will have new metadata (compatible with previous version).
Schema merging is switched off by default since Spark 1.5.0 since it is "relatively expensive operation" http://spark.apache.org/docs/latest/sql-programming-guide.html#schema-merging So infering most recent schema may not be as simple as it sounds. Although quick-and-dirty approaches are quite possible e.g. by parsing output from
$ parquet-tools schema /home/gz_files/result/000000_0