问题
I have a dataframe
df of columns
("id", "current_date", "days")
and I am trying to add the the "days
" to "current_date
" and create a new dataframe
with new column
called "new_date
" using spark scala function date_add()
val newDF = df.withColumn("new_Date", date_add(df("current_date"), df("days").cast("Int")))
But looks like the function date_add
only accepts Int
values and not columns
. How can get the desired output in such case? Are there any alternative functions i can use to get the desired output?
spark version: 1.6.0 scala version: 2.10.6
回答1:
A small custom udf can be used to make this date arithmetic possible.
import org.apache.spark.sql.functions.udf
import java.util.concurrent.TimeUnit
import java.util.Date
import java.text.SimpleDateFormat
val date_add = udf((x: String, y: Int) => {
val sdf = new SimpleDateFormat("yyyy-MM-dd")
val result = new Date(sdf.parse(x).getTime() + TimeUnit.DAYS.toMillis(y))
sdf.format(result)
} )
Usage:
scala> val df = Seq((1, "2017-01-01", 10), (2, "2017-01-01", 20)).toDF("id", "current_date", "days")
df: org.apache.spark.sql.DataFrame = [id: int, current_date: string, days: int]
scala> df.withColumn("new_Date", date_add($"current_date", $"days")).show()
+---+------------+----+----------+
| id|current_date|days| new_Date|
+---+------------+----+----------+
| 1| 2017-01-01| 10|2017-01-11|
| 2| 2017-01-01| 20|2017-01-21|
+---+------------+----+----------+
回答2:
No need to use an UDF, you can do it using an SQL expression:
val newDF = df.withColumn("new_date", expr("date_add(current_date,days)"))
来源:https://stackoverflow.com/questions/44361332/add-number-of-days-column-to-date-column-in-same-dataframe-for-spark-scala-app