I have a 2 dataframes looks like the following: First df1
TEST_schema = StructType([StructField("description", StringType(), True),\
StructField("date", StringType(), True)\
TEST_data = [('START',20200622),('END',20201018)]
rdd3 = sc.parallelize(TEST_data)
df1 = sqlContext.createDataFrame(TEST_data, TEST_schema)
|description| date|
| START|20200701|
| END|20201003|
And second df2
TEST_schema = StructType([StructField("date", StringType(), True),\
StructField("col1", IntegerType(), True),\
TEST_data = [('2020-08-01',1),('2020-08-02',-1),('2020-08-03',3),('2020-08-04',1),('2020-08-05',1),\
rdd3 = sc.parallelize(TEST_data)
TEST_df = sqlContext.createDataFrame(TEST_data, TEST_schema)
TEST_df = TEST_df.withColumn("date",to_date("date", 'yyyy-MM-dd'))
| date|col1|
|2020-08-01| 1|
|2020-08-02| -1|
|2020-08-03| 3|
|2020-08-04| 1|
|2020-08-05| 1|
|2020-08-06| 2|
|2020-08-07| 4|
|2020-08-08| 5|
|2020-08-09| -1|
First Step : What I want to do is, I want to select State date which is "20200701" how I did that is the following:
start = df1.filter(df1['description'] == 'START')
start = start.withColumn('value', to_date( col('date'), 'yyyyMMdd') )
DATE =start.select('value').collect()[0]['value']
and output as : 2020-07-01 which must be in date format.
Second Step : Now I want to add that DATE + df2['col1'] in the df2 dataframe. and leave blank if col1 == -1.
How I approached:
WANT= df2.withColumn('start', lit(DATE))
WANT= WANT.withColumn('want', when( col('col1') == -1, "").otherwise(date_add(col('start') ,col('col1')) ))
which did not work because second parameter in date_add function must be a value not a column.
my expected result is the following
| date|col1| want |
|2020-08-01| 1|2020-07-02 |
|2020-08-02| -1| |
|2020-08-03| 3|2020-07-04 |
|2020-08-04| 1|2020-07-02 |
|2020-08-05| 1|2020-07-02 |
|2020-08-06| 2|2020-07-03 |
|2020-08-07| 4|2020-07-05 |
|2020-08-08| 5|2020-07-07 |
|2020-08-09| -1| |
df2= df2.withColumn('start', lit(DATE)) \
df2= df2.select(
expr(f"IF(col1 == -1, NULL, date_add(start, col1))").alias('want'))
output: which gave all null values..
| date|col1| start|want|
|2020-08-01| 1|20200622|null|
|2020-08-02| -1|20200622|null|
|2020-08-03| 3|20200622|null|
|2020-08-04| 1|20200622|null|
|2020-08-05| 1|20200622|null|
|2020-08-06| 2|20200622|null|
|2020-08-07| 4|20200622|null|
|2020-08-08| 5|20200622|null|
|2020-08-09| -1|20200622|null|
You can simply add the column to the dataframe with a case statement:
df_final = df2.withColumn("want", expr("case when col1 <> -1 then date_add(to_date('2020-07-01'), col1) end"))
This can also be done easily in Spark SQL:
spark.sql("""select df2.date, df2.col1,
case when df2.col1 <> -1
then date_add(to_date('2020-07-01'), df2.col1) end want
from df2""").show(false)
This will work, as soon as you create views for Spark SQL as follows:
Either way you run it, here's the result
|date |col1|want |
|2020-08-01|1 |2020-07-02|
|2020-08-02|-1 |null |
|2020-08-03|3 |2020-07-04|
|2020-08-04|1 |2020-07-02|
|2020-08-05|1 |2020-07-02|
|2020-08-06|2 |2020-07-03|
|2020-08-07|4 |2020-07-05|
|2020-08-08|5 |2020-07-06|
|2020-08-09|-1 |null |