问题
For example, I have a dataset like this
test = spark.createDataFrame([
(0, 1, 5, "2018-06-03", "Region A"),
(1, 1, 2, "2018-06-04", "Region B"),
(2, 2, 1, "2018-06-03", "Region B"),
(3, 3, 1, "2018-06-01", "Region A"),
(3, 1, 3, "2018-06-05", "Region A"),
])\
.toDF("orderid", "customerid", "price", "transactiondate", "location")
test.show()
and I can obtain the customer-region order count matrix by
overall_stat = test.groupBy("customerid").agg(count("orderid"))\
.withColumnRenamed("count(orderid)", "overall_count")
temp_result = test.groupBy("customerid").pivot("location").agg(count("orderid")).na.fill(0).join(overall_stat, ["customerid"])
for field in temp_result.schema.fields:
if str(field.name) not in ['customerid', "overall_count", "overall_amount"]:
name = str(field.name)
temp_result = temp_result.withColumn(name, col(name)/col("overall_count"))
temp_result.show()
The data would look like this
Now, I want to calculate the weighted average by the overall_count
, how can I do it?
The result should be (0.66*3+1*1)/4
for region A, and (0.33*3+1*1)/4
for region B
My thoughts:
It can certainly be achieved through turning the data into python/pandas and then do some calculation, but in what cases should we use Pyspark?
I can get something like
temp_result.agg(sum(col("Region A") * col("overall_count")), sum(col("Region B")*col("overall_count"))).show()
but it doesn't feel right, especially if there is many region
s to count.
回答1:
you can achieve a weighted average by breaking your above steps into multiple stages.
Consider the following:
Dataframe Name: sales_table
[ total_sales, count_of_orders, location]
[ 50 , 9 , A ]
[ 80 , 4 , A ]
[ 90 , 7 , A ]
To calculate the grouped weighted average of the above (70) is broken into two steps:
- Multiplying
sales
byimportance
- Aggregating the
sales_x_count
product - Dividing
sales_x_count
by the sum of the original
If we break the above into several stages within our PySpark code, you can get the following:
new_sales = sales_table \
.withColumn("sales_x_count", col("total_sales") * col("count_orders")) \
.groupBy("Location") \
.agg(sf.sum("total_sales").alias("sum_total_sales"), \
sf.sum("sales_x_count").alias("sum_sales_x_count")) \
.withColumn("count_weighted_average", col("sum_sales_x_count") / col("sum_total_sales"))
So... no fancy UDF is really necessary here (and would likely slow you down).
来源:https://stackoverflow.com/questions/52240650/pyspark-weighted-average-by-a-column