问题
This is continuation to the question here in pyspark sql Add different Qtr start_date, End_date for exploded rows. Thanks.
I have the following dataframe which has a array list as a column.
+--------------+------------+----------+----------+---+---------+-----------+----------+
customer_number|sales_target|start_date|end_date |noq|cf_values|new_sdt |new_edate |
+--------------+------------+----------+----------+---+---------------------+----------+
|A011021 |15 |2020-01-01|2020-12-31|4 |[4,4,4,3]|2020-01-01 |2020-03-31|
|A011021 |15 |2020-01-01|2020-12-31|4 |[4,4,4,3]|2020-04-01 |2020-06-30|
|A011021 |15 |2020-01-01|2020-12-31|4 |[4,4,4,3]|2020-07-01 |2020-09-30|
|A011021 |15 |2020-01-01|2020-12-31|4 |[4,4,4,3]|2020-10-01 |2020-12-31|
+--------------+------------+----------+----------+---+---------------------+----------+
I need to have a column with one cf_values for each row, added withcolumn to existing record. If i use the explode, am getting dupicate records, so end up getting 16 records.
+--------------+------------+----------+----------+---+---------+------+-----------+----------+
customer_number|sales_target|start_date|end_date |noq|cf_values|cf_new|new_sdt |new_edate |
+--------------+------------+----------+----------+---+---------+------------------+----------+
|A011021 |15 |2020-01-01|2020-12-31|4 |[4,4,4,3]|4 |2020-01-01 |2019-12-31|
|A011021 |15 |2020-01-01|2020-12-31|4 |[4,4,4,3]|4 |2020-01-01 |2019-12-31|
|A011021 |15 |2020-01-01|2020-12-31|4 |[4,4,4,3]|4 |2020-01-01 |2019-12-31|
|A011021 |15 |2020-01-01|2020-12-31|4 |[4,4,4,3]|3 |2020-01-01 |2020-03-31|
|A011021 |15 |2020-01-01|2020-12-31|4 |[4,4,4,3]|4 |2020-04-01 |2020-03-31|
|A011021 |15 |2020-01-01|2020-12-31|4 |[4,4,4,3]|4 |2020-04-01 |2020-03-31|
|A011021 |15 |2020-01-01|2020-12-31|4 |[4,4,4,3]|4 |2020-04-01 |2020-03-31|
|A011021 |15 |2020-01-01|2020-12-31|4 |[4,4,4,3]|3 |2020-04-01 |2020-06-30|
|A011021 |15 |2020-01-01|2020-12-31|4 |[4,4,4,3]|4 |2020-07-01 |2020-06-30|
|A011021 |15 |2020-01-01|2020-12-31|4 |[4,4,4,3]|4 |2020-07-01 |2020-06-30|
|A011021 |15 |2020-01-01|2020-12-31|4 |[4,4,4,3]|4 |2020-07-01 |2020-06-30|
|A011021 |15 |2020-01-01|2020-12-31|4 |[4,4,4,3]|3 |2020-07-01 |2020-09-30|
|A011021 |15 |2020-01-01|2020-12-31|4 |[4,4,4,3]|4 |2020-10-01 |2020-09-30|
|A011021 |15 |2020-01-01|2020-12-31|4 |[4,4,4,3]|4 |2020-10-01 |2020-09-30|
|A011021 |15 |2020-01-01|2020-12-31|4 |[4,4,4,3]|4 |2020-10-01 |2020-09-30|
|A011021 |15 |2020-01-01|2020-12-31|4 |[4,4,4,3]|3 |2020-10-01 |2020-12-30|
+--------------+------------+----------+----------+---+---------+------------------+----------+
Expected result: 4 records with 4 different cf_values, new start_date new_end_date.
+--------------+------------+----------+----------+---+------+-----------+----------+
customer_number|sales_target|start_date|end_date |noq|cf_new|new_sdt |new_edate |
+--------------+------------+----------+----------+---+------------------+----------+
|A011021 |15 |2020-01-01|2020-12-31|4 |4 |2020-01-01 |2020-03-31|
|A011021 |15 |2020-01-01|2020-12-31|4 |4 |2020-04-01 |2020-06-30|
|A011021 |15 |2020-01-01|2020-12-31|4 |4 |2020-07-01 |2020-09-30|
|A011021 |15 |2020-01-01|2020-12-31|4 |3 |2020-10-01 |2020-12-31|
+--------------+------------+----------+----------+---+------------------+----------+
回答1:
Instead of exploding the array, you can pick the values from the array based on it's position.
This position can be dynamically generated using row_number
as shown below.
from pyspark.sql.functions import row_number, expr
from pyspark.sql import Window
window = Window.partitionBy('customer_number').orderBy('new_sdt')
df.withColumn('row_num', row_number().over(window)).\
withColumn('cf_new', expr("cf_values[row_num - 1]")).\
drop('row_num').show()
Output:
+---------------+------------+----------+----------+---+------------+----------+----------+------+
|customer_number|sales_target|start_date| end_date|noq| cf_values| new_sdt| new_edate|cf_new|
+---------------+------------+----------+----------+---+------------+----------+----------+------+
| A011021| 15|2020-01-01|2020-12-31| 4|[4, 4, 4, 3]|2020-01-01|2020-03-31| 4|
| A011021| 15|2020-01-01|2020-12-31| 4|[4, 4, 4, 3]|2020-04-01|2020-06-30| 4|
| A011021| 15|2020-01-01|2020-12-31| 4|[4, 4, 4, 3]|2020-07-01|2020-09-30| 4|
| A011021| 15|2020-01-01|2020-12-31| 4|[4, 4, 4, 3]|2020-10-01|2020-12-31| 3|
+---------------+------------+----------+----------+---+------------+----------+----------+------+
来源:https://stackoverflow.com/questions/62356574/how-to-explode-an-array-without-duplicate-records