Spark SubQuery scan whole partition

后端 未结 2 930
小蘑菇
小蘑菇 2021-01-13 11:08

I have a hive table which is partitioned by \'date\' field i want to write a query to get the data from latest(max) partition.

spark.sql("select field fr         


        
相关标签:
2条回答
  • 2021-01-13 11:31

    Building on Ram's answer, there is a much simpler way to accomplish this that eliminates a lot of overhead by querying the Hive metastore directly, rather than executing a Spark-SQL query. No need to reinvent the wheel:

    import org.apache.hadoop.hive.conf.HiveConf
    import scala.collection.JavaConverters._
    import org.apache.hadoop.hive.metastore.HiveMetaStoreClient
    
    val hiveConf = new HiveConf(spark.sparkContext.hadoopConfiguration, classOf[HiveConf])
    val cli = new HiveMetaStoreClient(hiveConf)
    val maxPart = cli.listPartitions("<db_name>", "<tbl_name>", Short.MaxValue).asScala.map(_.getValues.asScala.mkString(",")).max
    
    0 讨论(0)
  • 2021-01-13 11:34

    If I were you... I'd prefer different approach rather than sql query and full table scan.

    spark.sql(s"show partitions $tablename")
    

    Then, I will convert that in to Seq[scala.collection.immutable.Map[String, org.joda.time.DateTime] which has joda date columns

    /**
        * listMyHivePartitions - lists hive partitions as sequence of map
        * @param tableName String
        * @param spark SparkSession
        * @return Seq[Map[String, DateTime]]
        */
      def listMyHivePartitions(tableName :String,spark:SparkSession) : Seq[Map[String, DateTime]]  = {
        println(s"Listing the keys from ${tableName}")
        val partitions: Seq[String] = spark.sql(s"show partitions ${tableName}").collect().map(row => {
          println(s" Identified Key: ${row.toString()}")
          row.getString(0)
        }).toSeq
        println(s"Fetched ${partitions.size}  partitons from ${tableName}")
        partitions.map(key => key.split("/").toSeq.map(keyVal => {
          val keyValSplit = keyVal.split("=")
          (keyValSplit(0).toLowerCase().trim, new DateTime(keyValSplit(1).trim))
        }).toMap)
      }
    

    and will apply

    getRecentPartitionDate like below

    /**
        * getRecentPartitionDate.
        *
        * @param column   String
        * @param seqOfMap { @see Seq[scala.collection.immutable.Map[String, org.joda.time.DateTime]}
        **/
      def getRecentPartitionDate(column: String, seqOfMap: Seq[scala.collection.immutable.Map[String, org.joda.time.DateTime]]): Option[Map[String, DateTime]] = {
        logger.info(" >>>>> column " + column)
        val mapWithMostRecentBusinessDate = seqOfMap.sortWith(
          (a, b) => {
            logger.debug(a(column).toString() + " col2" + b(column).toString())
            a(column).isAfter(b(column))
          }
        )
    
        logger.debug(s" mapWithMostRecentBusinessDate: $mapWithMostRecentBusinessDate , \n Head = ${mapWithMostRecentBusinessDate.headOption} ")
    
        mapWithMostRecentBusinessDate.headOption
      }
    

    Advantage is no sqls no full table scans...

    The above can be also applied when you query from hivemetastore which is database at backend will file show paritions table on that, result of the query is java.sql.ResultSet

     /**
            * showParts.
            *
            * @param table
            * @param config
            * @param stmt
            */
          def showParts(table: String, config: Config, stmt: Statement): Seq[scala.collection.immutable.Map[String, org.joda.time.DateTime]] = {
            val showPartitionsCmd = " show partitions " + table;
            logger.info("showPartitionsCmd " + showPartitionsCmd)
            try {
              val resultSet = stmt.executeQuery(showPartitionsCmd)
    
              // checkData(resultSet)
              val result = resultToSeq(resultSet);
              logger.info(s"partitions of $table ->" + showPartitionsCmd + table);
              logger.debug("result " + result)
    
              result
            }
            catch {
              case e: Exception => logger.error(s"Exception occurred while show partitions table $table..", e)
                null
            }
          }
    
          /** *
            * resultToSeq.
            *
            * @param queryResult
            */
          def resultToSeq(queryResult: ResultSet) = {
            val md = queryResult.getMetaData
    
            val colNames = for (i <- 1 to md.getColumnCount) yield md.getColumnName(i)
            var rows = Seq[scala.collection.immutable.Map[String, org.joda.time.DateTime]]()
            while (queryResult.next()) {
              var row = scala.collection.immutable.Map.empty[String, DateTime]
              for (n <- colNames) {
                val str = queryResult.getString(n).split("=")
    
                //str.foreach(logger.info)
                import org.joda.time.format.DateTimeFormat
                val format = DateTimeFormat.forPattern("yyyy-MM-dd")
                row += str(0) -> DateTime.parse(str(1)) //.toString(DateTimeFormat.shortDate())
                logger.debug(row.toString())
              }
              rows = rows :+ row
            }
    
            rows
          }
    

    after getting seq of map I will apply def in top i.e. getRecentPartitionDate

    0 讨论(0)
提交回复
热议问题