I am working with AWS and I have workflows that use Spark and Hive. My data is partitioned by the date, so everyday I have a new partition in my S3 storage. My problem is when
So, if you are using Spark version < 2.3 and want to write into partitions dynamically without deleting the others, you can implement the below solution.
The idea is to register the dataset as a table and then use spark.sql() to run the INSERT query.
// Create SparkSession with Hive dynamic partitioning enabled
val spark: SparkSession =
SparkSession
.builder()
.appName("StatsAnalyzer")
.enableHiveSupport()
.config("hive.exec.dynamic.partition", "true")
.config("hive.exec.dynamic.partition.mode", "nonstrict")
.getOrCreate()
// Register the dataframe as a Hive table
impressionsDF.createOrReplaceTempView("impressions_dataframe")
// Create the output Hive table
spark.sql(
s"""
|CREATE EXTERNAL TABLE stats (
| ad STRING,
| impressions INT,
| clicks INT
|) PARTITIONED BY (country STRING, year INT, month INT, day INT)
|ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' LINES TERMINATED BY '\n'
""".stripMargin
)
// Write the data into disk as Hive partitions
spark.sql(
s"""
|INSERT OVERWRITE TABLE stats
|PARTITION(country = 'US', year = 2017, month = 3, day)
|SELECT ad, SUM(impressions), SUM(clicks), day
|FROM impressions_dataframe
|GROUP BY ad
""".stripMargin
)