I have some data which has timestamp column field which is long and its epoch standard , I need to save that data in split-ted format like yyyy/mm/dd/hh using spark scala
<
You can leverage various spark sql date/time functions for this. First, you add a new date type column created from the unix timestamp column.
val withDateCol = data
.withColumn("date_col", from_unixtime(col("timestamp", "YYYYMMddHH"))
After this, you can add year, month, day and hour columns to the DF and then partition by these new columns for the write.
withDateCol
.withColumn("year", year(col("date_col")))
.withColumn("month", month(col("date_col")))
.withColumn("day", dayofmonth(col("date_col")))
.withColumn("hour", hour(col("date_col")))
.drop("date_col")
.partitionBy("year", "month", "day", "hour")
.format("orc")
.save("mypath")
The columns included in the partitionBy clause wont be part of the file schema.