overwrite hive partitions using spark

后端 未结 4 1169
北荒
北荒 2021-02-05 21:44

I am working with AWS and I have workflows that use Spark and Hive. My data is partitioned by the date, so everyday I have a new partition in my S3 storage. My problem is when

4条回答
  •  心在旅途
    2021-02-05 22:26

    So, if you are using Spark version < 2.3 and want to write into partitions dynamically without deleting the others, you can implement the below solution.

    The idea is to register the dataset as a table and then use spark.sql() to run the INSERT query.

    // Create SparkSession with Hive dynamic partitioning enabled
    val spark: SparkSession =
        SparkSession
            .builder()
            .appName("StatsAnalyzer")
            .enableHiveSupport()
            .config("hive.exec.dynamic.partition", "true")
            .config("hive.exec.dynamic.partition.mode", "nonstrict")
            .getOrCreate()
    // Register the dataframe as a Hive table
    impressionsDF.createOrReplaceTempView("impressions_dataframe")
    // Create the output Hive table
    spark.sql(
        s"""
          |CREATE EXTERNAL TABLE stats (
          |   ad            STRING,
          |   impressions   INT,
          |   clicks        INT
          |) PARTITIONED BY (country STRING, year INT, month INT, day INT)
          |ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' LINES TERMINATED BY '\n'
        """.stripMargin
    )
    // Write the data into disk as Hive partitions
    spark.sql(
        s"""
          |INSERT OVERWRITE TABLE stats 
          |PARTITION(country = 'US', year = 2017, month = 3, day)
          |SELECT ad, SUM(impressions), SUM(clicks), day
          |FROM impressions_dataframe
          |GROUP BY ad
        """.stripMargin
    )
    

提交回复
热议问题