How to find the median in Apache Spark with Python Dataframe API?

问题

Pyspark API provides many aggregate functions except the median. Spark 2 comes with approxQuantile which gives approximate quantiles but exact median is very expensive to calculate. Is there a more Pyspark way of calculating median for a column of values in a Spark Dataframe?

回答1:

Here is an example implementation with Dataframe API in Python (Spark 1.6 +).

import pyspark.sql.functions as F
import numpy as np
from pyspark.sql.types import FloatType

Let's assume we have monthly salaries for customers in "salaries" spark dataframe such as:

month | customer_id | salary

and we would like to find the median salary per customer throughout all the months

Step1: Write a user defined function to calculate the median

def find_median(values_list):
    try:
        median = np.median(values_list) #get the median of values in a list in each row
        return round(float(median),2)
    except Exception:
        return None #if there is anything wrong with the given values

median_finder = F.udf(find_median,FloatType())

Step 2: Aggregate on the salary column by collecting them into a list of salaries in each row:

salaries_list = salaries.groupBy("customer_id").agg(F.collect_list("salary").alias("salaries"))

Step 3: Call the median_finder udf on the salaries column and add the median values as a new column

salaries_list = salaries_list.withColumn("median",median_finder("salaries"))

来源：https://stackoverflow.com/questions/38743476/how-to-find-the-median-in-apache-spark-with-python-dataframe-api

标签

python

apache-spark

pyspark

median

易学教程内所有资源均来自网络或用户发布的内容，如有违反法律规定的内容欢迎反馈！
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!