spark partition data writing by timestamp

前端未结

关注

 2  682

迷失自我 2021-02-08 14:56

I have some data which has timestamp column field which is long and its epoch standard , I need to save that data in split-ted format like yyyy/mm/dd/hh using spark scala

2条回答

抹茶落季 (楼主)

2021-02-08 15:17
You can leverage various spark sql date/time functions for this. First, you add a new date type column created from the unix timestamp column.
```
val withDateCol = data
.withColumn("date_col", from_unixtime(col("timestamp", "YYYYMMddHH"))
```
After this, you can add year, month, day and hour columns to the DF and then partition by these new columns for the write.
```
withDateCol
.withColumn("year", year(col("date_col")))
.withColumn("month", month(col("date_col")))
.withColumn("day", dayofmonth(col("date_col")))
.withColumn("hour", hour(col("date_col")))
.drop("date_col")
.partitionBy("year", "month", "day", "hour")
.format("orc")
.save("mypath") 
```
The columns included in the partitionBy clause wont be part of the file schema.
0 讨论(0)

查看其它2个回答
发布评论:

提交评论
- 加载中...