overwrite hive partitions using spark

后端未结

关注

 4  1170

I am working with AWS and I have workflows that use Spark and Hive. My data is partitioned by the date, so everyday I have a new partition in my S3 storage. My problem is when

相关标签:

4条回答

半阙折子戏

2021-02-05 22:14

Adding to what wandermonk@ mentioned,

Dynamic Partition Inserts is only supported in SQL mode (for INSERT OVERWRITE TABLE SQL statements). Dynamic Partition Inserts is not supported for non-file-based data sources, i.e. InsertableRelations.

With Dynamic Partition Inserts, the behaviour of OVERWRITE keyword is controlled by spark.sql.sources.partitionOverwriteMode configuration property (default: static). The property controls whether Spark should delete all the partitions that match the partition specification regardless of whether there is data to be written to or not (static) or delete only those partitions that will have data written into (dynamic).

When the dynamic overwrite mode is enabled Spark will only delete the partitions for which it has data to be written to. All the other partitions remain intact.

From

From the Writing Into Dynamic Partitions Using Spark (https://medium.com/a-muggles-pensieve/writing-into-dynamic-partitions-using-spark-2e2b818a007a)

Spark now writes data partitioned just as Hive would — which means only the partitions that are touched by the INSERT query get overwritten and the others are not touched.

0 讨论(0)
发布评论:

提交评论
- 加载中...
花落未央

2021-02-05 22:18

I would suggest to run sql using sparksession. you can run " insert overwrite partition query" by selecting the columns from existing dataset. this solution will surely overwrites partition only.

0 讨论(0)
发布评论:

提交评论
- 加载中...

心在旅途

2021-02-05 22:26

So, if you are using Spark version < 2.3 and want to write into partitions dynamically without deleting the others, you can implement the below solution.

The idea is to register the dataset as a table and then use spark.sql() to run the INSERT query.

// Create SparkSession with Hive dynamic partitioning enabled
val spark: SparkSession =
    SparkSession
        .builder()
        .appName("StatsAnalyzer")
        .enableHiveSupport()
        .config("hive.exec.dynamic.partition", "true")
        .config("hive.exec.dynamic.partition.mode", "nonstrict")
        .getOrCreate()
// Register the dataframe as a Hive table
impressionsDF.createOrReplaceTempView("impressions_dataframe")
// Create the output Hive table
spark.sql(
    s"""
      |CREATE EXTERNAL TABLE stats (
      |   ad            STRING,
      |   impressions   INT,
      |   clicks        INT
      |) PARTITIONED BY (country STRING, year INT, month INT, day INT)
      |ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' LINES TERMINATED BY '\n'
    """.stripMargin
)
// Write the data into disk as Hive partitions
spark.sql(
    s"""
      |INSERT OVERWRITE TABLE stats 
      |PARTITION(country = 'US', year = 2017, month = 3, day)
      |SELECT ad, SUM(impressions), SUM(clicks), day
      |FROM impressions_dataframe
      |GROUP BY ad
    """.stripMargin
)

0 讨论(0)

名媛妹妹

2021-02-05 22:33
If you are on Spark 2.3.0, try setting spark.sql.sources.partitionOverwriteMode setting to dynamic, the dataset needs to be partitioned, and the write mode overwrite.
```
spark.conf.set("spark.sql.sources.partitionOverwriteMode","dynamic")
data.write.mode("overwrite").insertInto("partitioned_table")
```
0 讨论(0)
发布评论:

提交评论
- 加载中...