Pyspark: how to add Date + numeric value format

别等时光非礼了梦想. 提交于 2020-08-11 09:31:12

问题


I have a 2 dataframes looks like the following: First df1

TEST_schema = StructType([StructField("description", StringType(), True),\
                          StructField("date", StringType(), True)\
                          ])
TEST_data = [('START',20200622),('END',20201018)]
rdd3 = sc.parallelize(TEST_data)
df1 = sqlContext.createDataFrame(TEST_data, TEST_schema)
df1.show() 

+-----------+--------+
|description|    date|
+-----------+--------+
|      START|20200701| 
|        END|20201003| 
+-----------+--------+

And second df2

TEST_schema = StructType([StructField("date", StringType(), True),\
                          StructField("col1", IntegerType(), True),\
                          ])
TEST_data = [('2020-08-01',1),('2020-08-02',-1),('2020-08-03',3),('2020-08-04',1),('2020-08-05',1),\
             ('2020-08-06',2),('2020-08-07',4),('2020-08-08',5),('2020-08-09',-1)]
rdd3 = sc.parallelize(TEST_data)
TEST_df = sqlContext.createDataFrame(TEST_data, TEST_schema)
TEST_df = TEST_df.withColumn("date",to_date("date", 'yyyy-MM-dd'))
TEST_df.show() 

+----------+----+
|      date|col1|
+----------+----+
|2020-08-01|   1|
|2020-08-02|  -1|
|2020-08-03|   3|
|2020-08-04|   1|
|2020-08-05|   1|
|2020-08-06|   2|
|2020-08-07|   4|
|2020-08-08|   5|
|2020-08-09|  -1|
+----------+----+

First Step : What I want to do is, I want to select State date which is "20200701" how I did that is the following:

start = df1.filter(df1['description'] == 'START')
start = start.withColumn('value', to_date( col('date'), 'yyyyMMdd')   )
DATE =start.select('value').collect()[0]['value']
print(DATE)

and output as : 2020-07-01 which must be in date format.

Second Step : Now I want to add that DATE + df2['col1'] in the df2 dataframe. and leave blank if col1 == -1.

How I approached:

WANT= df2.withColumn('start', lit(DATE)) 

WANT= WANT.withColumn('want', when( col('col1') == -1, "").otherwise(date_add(col('start') ,col('col1'))   ))

which did not work because second parameter in date_add function must be a value not a column.

my expected result is the following

+----------+----+------------+
|      date|col1|    want   |
+----------+----+------------+
|2020-08-01|   1|2020-07-02 |
|2020-08-02|  -1|           |
|2020-08-03|   3|2020-07-04 |
|2020-08-04|   1|2020-07-02 |
|2020-08-05|   1|2020-07-02 |
|2020-08-06|   2|2020-07-03 |
|2020-08-07|   4|2020-07-05 |
|2020-08-08|   5|2020-07-07 |
|2020-08-09|  -1|           |
+----------+----+-----------+

ATTEMPT 1 :

df2= df2.withColumn('start', lit(DATE)) \

df2= df2.select(
    '*',
    expr(f"IF(col1 == -1, NULL, date_add(start, col1))").alias('want'))
df2.show()

output: which gave all null values..

+----------+----+--------+----+
|      date|col1|   start|want|
+----------+----+--------+----+
|2020-08-01|   1|20200622|null|
|2020-08-02|  -1|20200622|null|
|2020-08-03|   3|20200622|null|
|2020-08-04|   1|20200622|null|
|2020-08-05|   1|20200622|null|
|2020-08-06|   2|20200622|null|
|2020-08-07|   4|20200622|null|
|2020-08-08|   5|20200622|null|
|2020-08-09|  -1|20200622|null|
+----------+----+--------+----+

回答1:


You can simply add the column to the dataframe with a case statement:

df_final = df2.withColumn("want", expr("case when col1 <> -1  then date_add(to_date('2020-07-01'), col1) end"))

This can also be done easily in Spark SQL:

spark.sql("""select df2.date, df2.col1, 
              case when df2.col1 <> -1  
               then date_add(to_date('2020-07-01'), df2.col1) end want
             from df2""").show(false)

This will work, as soon as you create views for Spark SQL as follows:

df2.createOrReplaceTempView("df2")

Either way you run it, here's the result

+----------+----+----------+
|date      |col1|want      |
+----------+----+----------+
|2020-08-01|1   |2020-07-02|
|2020-08-02|-1  |null      |
|2020-08-03|3   |2020-07-04|
|2020-08-04|1   |2020-07-02|
|2020-08-05|1   |2020-07-02|
|2020-08-06|2   |2020-07-03|
|2020-08-07|4   |2020-07-05|
|2020-08-08|5   |2020-07-06|
|2020-08-09|-1  |null      |
+----------+----+----------+


来源:https://stackoverflow.com/questions/63348747/pyspark-how-to-add-date-numeric-value-format

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!