spark partition data writing by timestamp

前端 未结 2 676
迷失自我
迷失自我 2021-02-08 14:56

I have some data which has timestamp column field which is long and its epoch standard , I need to save that data in split-ted format like yyyy/mm/dd/hh using spark scala

<
2条回答
  •  抹茶落季
    2021-02-08 15:17

    You can leverage various spark sql date/time functions for this. First, you add a new date type column created from the unix timestamp column.

    val withDateCol = data
    .withColumn("date_col", from_unixtime(col("timestamp", "YYYYMMddHH"))
    

    After this, you can add year, month, day and hour columns to the DF and then partition by these new columns for the write.

    withDateCol
    .withColumn("year", year(col("date_col")))
    .withColumn("month", month(col("date_col")))
    .withColumn("day", dayofmonth(col("date_col")))
    .withColumn("hour", hour(col("date_col")))
    .drop("date_col")
    .partitionBy("year", "month", "day", "hour")
    .format("orc")
    .save("mypath") 
    

    The columns included in the partitionBy clause wont be part of the file schema.

提交回复
热议问题