问题
I Want add missing values with zero sales and calculate 3 month average in pyspark
My Input :
product specialty date sales
A pharma 1/3/2019 50
A pharma 1/4/2019 60
A pharma 1/5/2019 70
A pharma 1/8/2019 80
A ENT 1/8/2019 50
A ENT 1/9/2019 65
A ENT 1/11/2019 40
my output:
product specialty date sales 3month_avg_sales
A pharma 1/3/2019 50 16.67
A pharma 1/4/2019 60 36.67
A pharma 1/5/2019 70 60
A pharma 1/6/2019 0 43.33
A pharma 1/7/2019 0 23.33
A pharma 1/8/2019 80 26.67
A ENT 1/8/2019 50 16.67
A ENT 1/9/2019 65 38.33
A ENT 1/10/2019 0 38.33
A ENT 1/11/2019 40 35
row = Row("Product", "specialty","Date", "Sales")
df = sc.parallelize([row("A","pharma", "1/3/2019", 50),row("A","pharma", "1/4/2019", 60),row("A", "pharma","01/05/2019", 70),row("A","pharma", "1/8/2019", 80),row("A","ENT", "1/8/2019", 50),row("A","ENT", "1/9/2019", 65),row("A","ENT", "1/11/2019", 40)]).toDF()
w = Window.partitionBy("product","specialty).orderBy("date")
df.withColumn("new_data_date", expr("add_months(data_date, 1)"))
df.withcolumn("sales",F.where(col("date") isin col("new_data_date")
df=df.withColumn('index', (year('Date') - 2020) * 12 + month('Date')).withColumn('avg',sum('Sales').over(w) / 3)
I am struck adding where ever date value is missed with sales value is zero . And calculate 3month average .
回答1:
You can use SparkSQL builtin functions transform + sequence to create the missing months and set their sales=0, use Window aggregate function to calculate required end_date
and the final 3-month average sales. Below I divided the code into three steps for illustration purpose, you can merge them based on your own requirements.
Note: this assumed at most one record in each distinct month and all the date values have day=1, otherwise truncate the date to the month level using F.trunc(F.to_date('date', 'd/M/yyyy'), "month")
and/or define the logic for duplicate entries.
from pyspark.sql import functions as F, Window
df = spark.createDataFrame([
('A', 'pharma', '1/3/2019', 50), ('A', 'pharma', '1/4/2019', 60),
('A', 'pharma', '1/5/2019', 70), ('A', 'pharma', '1/8/2019', 80),
('A', 'ENT', '1/8/2019', 50), ('A', 'ENT', '1/9/2019', 65),
('A', 'ENT', '1/11/2019', 40)
], ['product', 'specialty', 'date', 'sales'])
df = df.withColumn('date', F.to_date('date', 'd/M/yyyy'))
Step-1: set up WinSpec w1
and use Window aggregate function lead to find the next date over(w1), convert it to the previous months to set up date sequences:
w1 = Window.partitionBy('product', 'specialty').orderBy('date')
df1 = df.withColumn('end_date', F.coalesce(F.add_months(F.lead('date').over(w1),-1),'date'))
+-------+---------+----------+-----+----------+
|product|specialty| date|sales| end_date|
+-------+---------+----------+-----+----------+
| A| ENT|2019-08-01| 50|2019-08-01|
| A| ENT|2019-09-01| 65|2019-10-01|
| A| ENT|2019-11-01| 40|2019-11-01|
| A| pharma|2019-03-01| 50|2019-03-01|
| A| pharma|2019-04-01| 60|2019-04-01|
| A| pharma|2019-05-01| 70|2019-07-01|
| A| pharma|2019-08-01| 80|2019-08-01|
+-------+---------+----------+-----+----------+
Step-2: use months_between(end_date, date)
to calculate # of months between two dates, and use transform function to iterate through sequence(0, #months)
, create a named_struct with date=add_months(date,i)
and sales=IF(i=0,sales,0)
, use inline_outer to explode the array of structs:
df2 = df1.selectExpr("product", "specialty", """
inline_outer(
transform(
sequence(0,int(months_between(end_date, date))),
i -> (add_months(date,i) as date, IF(i=0,sales,0) as sales)
)
)
""")
+-------+---------+----------+-----+
|product|specialty| date|sales|
+-------+---------+----------+-----+
| A| ENT|2019-08-01| 50|
| A| ENT|2019-09-01| 65|
| A| ENT|2019-10-01| 0|
| A| ENT|2019-11-01| 40|
| A| pharma|2019-03-01| 50|
| A| pharma|2019-04-01| 60|
| A| pharma|2019-05-01| 70|
| A| pharma|2019-06-01| 0|
| A| pharma|2019-07-01| 0|
| A| pharma|2019-08-01| 80|
+-------+---------+----------+-----+
Step-3: use the following WinSpec w2
and the aggregate function to calculate the average:
N = 3
w2 = Window.partitionBy('product', 'specialty').orderBy('date').rowsBetween(-N+1,0)
df_new = df2.select("*", F.round(F.sum('sales').over(w2)/N,2).alias(f'{N}month_avg_sales'))
+-------+---------+----------+-----+----------------+
|product|specialty| date|sales|3month_avg_sales|
+-------+---------+----------+-----+----------------+
| A| ENT|2019-08-01| 50| 16.67|
| A| ENT|2019-09-01| 65| 38.33|
| A| ENT|2019-10-01| 0| 38.33|
| A| ENT|2019-11-01| 40| 35.0|
| A| pharma|2019-03-01| 50| 16.67|
| A| pharma|2019-04-01| 60| 36.67|
| A| pharma|2019-05-01| 70| 60.0|
| A| pharma|2019-06-01| 0| 43.33|
| A| pharma|2019-07-01| 0| 23.33|
| A| pharma|2019-08-01| 80| 26.67|
+-------+---------+----------+-----+----------------+
回答2:
For the missing value you can just do
df.fillna(0, subset=['sales'])
For the 3 months average, you can find a good answer here, just be careful to parse correctly timestamp and change the window starting day to -90
UPDATE
This code should do the job you look for
days = lambda i: i * 86400
w = (Window.orderBy(f.col("timestampGMT").cast('long')).rangeBetween(-days(90), 0))
missings_df = sparkSession.createDataFrame([ ('A', 'pharma', '1/6/2019', 0)], ['product', 'specialty', 'date', 'sales'])
df = (df
.union(missings_df) # adding missing row
.withColumn('timestampGMT', f.to_date('date', 'd/M/yyyy').cast('timestamp')) # cast to timestamp
.withColumn('rolling_average', f.avg("sales").over(w)) # rolling average on 90 days
)
来源:https://stackoverflow.com/questions/64486068/filling-missing-sales-value-with-zero-and-calculate-3-month-average-in-pyspark