I have the following dataframe:
corr_temp_df
[(\'vacationdate\', \'date\'),
(\'valueE\', \'string\'),
(\'valueD\', \'string\'),
(\'valueC\', \'string\'),
Lets create some dummy data:
import datetime
from pyspark.sql import Row
from pyspark.sql.functions import col
row = Row("vacationdate")
df = sc.parallelize([
row(datetime.date(2015, 10, 07)),
row(datetime.date(1971, 01, 01))
]).toDF()
If you Spark >= 1.5.0 you can use date_format
function:
from pyspark.sql.functions import date_format
(df
.select(date_format(col("vacationdate"), "dd-MM-YYYY")
.alias("date_string"))
.show())
In Spark < 1.5.0 it can be done using Hive UDF:
df.registerTempTable("df")
sqlContext.sql(
"SELECT date_format(vacationdate, 'dd-MM-YYYY') AS date_string FROM df")
It is of course still available in Spark >= 1.5.0.
If you don't use HiveContext
you can mimic date_format
using UDF:
from pyspark.sql.functions import udf, lit
my_date_format = udf(lambda d, fmt: d.strftime(fmt))
df.select(
my_date_format(col("vacationdate"), lit("%d-%m-%Y")).alias("date_string")
).show()
Please note it is using C standard format not a Java simple date format