I have a hive table that is built on top of a load of external parquet files. Parquet files should be generated by the spark job, but due to setting metadata flag to false they
Ok so here is the drill, metadata can be accessed directly using Parquet tools. You'll need to get the footers for your parquet file first :
import scala.collection.JavaConverters.{collectionAsScalaIterableConverter, mapAsScalaMapConverter}
import org.apache.parquet.hadoop.ParquetFileReader
import org.apache.hadoop.fs.{FileSystem, Path}
import org.apache.hadoop.conf.Configuration
val conf = spark.sparkContext.hadoopConfiguration
def getFooters(conf: Configuration, path: String) = {
val fs = FileSystem.get(conf)
val footers = ParquetFileReader.readAllFootersInParallel(conf, fs.getFileStatus(new Path(path)))
footers
}
Now you can get your file metadata as followed :
def getFileMetadata(conf: Configuration, path: String) = {
getFooters(conf, path)
.asScala.map(_.getParquetMetadata.getFileMetaData.getKeyValueMetaData.asScala)
}
Now you can get the metadata of your parquet file :
getFileMetadata(conf, "/tmp/foo").headOption
// Option[scala.collection.mutable.Map[String,String]] =
// Some(Map(org.apache.spark.sql.parquet.row.metadata ->
// {"type":"struct","fields":[{"name":"id","type":"long","nullable":false,"metadata":{"foo":"bar"}}
// {"name":"txt","type":"string","nullable":true,"metadata":{}}]}))
We can also use extracted footers to write standalone metadata file when needed:
import org.apache.parquet.hadoop.ParquetFileWriter
def createMetadata(conf: Configuration, path: String) = {
val footers = getFooters(conf, path)
ParquetFileWriter.writeMetadataFile(conf, new Path(path), footers)
}
I hope this answers your question. You can read more about Spark DataFrames and Metadata on awesome-spark's spark-gotchas repo.