Generate metadata for parquet files

后端 未结 1 979
借酒劲吻你
借酒劲吻你 2021-02-05 13:44

I have a hive table that is built on top of a load of external parquet files. Parquet files should be generated by the spark job, but due to setting metadata flag to false they

1条回答
  •  别跟我提以往
    2021-02-05 14:03

    Ok so here is the drill, metadata can be accessed directly using Parquet tools. You'll need to get the footers for your parquet file first :

    import scala.collection.JavaConverters.{collectionAsScalaIterableConverter, mapAsScalaMapConverter}
    
    import org.apache.parquet.hadoop.ParquetFileReader
    import org.apache.hadoop.fs.{FileSystem, Path}
    import org.apache.hadoop.conf.Configuration
    
    val conf = spark.sparkContext.hadoopConfiguration
    
    def getFooters(conf: Configuration, path: String) = {
      val fs = FileSystem.get(conf)
      val footers = ParquetFileReader.readAllFootersInParallel(conf, fs.getFileStatus(new Path(path)))
      footers
    }
    

    Now you can get your file metadata as followed :

    def getFileMetadata(conf: Configuration, path: String) = {
      getFooters(conf, path)
        .asScala.map(_.getParquetMetadata.getFileMetaData.getKeyValueMetaData.asScala)
    }
    

    Now you can get the metadata of your parquet file :

    getFileMetadata(conf, "/tmp/foo").headOption
    
    // Option[scala.collection.mutable.Map[String,String]] =
    //   Some(Map(org.apache.spark.sql.parquet.row.metadata ->
    //     {"type":"struct","fields":[{"name":"id","type":"long","nullable":false,"metadata":{"foo":"bar"}}
    //     {"name":"txt","type":"string","nullable":true,"metadata":{}}]}))
    

    We can also use extracted footers to write standalone metadata file when needed:

    import org.apache.parquet.hadoop.ParquetFileWriter
    
    def createMetadata(conf: Configuration, path: String) = {
      val footers = getFooters(conf, path)
      ParquetFileWriter.writeMetadataFile(conf, new Path(path), footers)
    }
    

    I hope this answers your question. You can read more about Spark DataFrames and Metadata on awesome-spark's spark-gotchas repo.

    0 讨论(0)
提交回复
热议问题