Find median in spark SQL for multiple double datatype columns

后端未结

关注

 1  798

I have a requirement to find median for multiple double datatype columns.Request suggestion to find the correct approach.

Below is my sample dataset with one column.

相关标签:

1条回答

面向向阳花

2021-01-13 00:23

Which version of Apache Spark are you using out of curiosity? There were some fixes within the Apache Spark 2.0+ which included changes to approxQuantile.

If I was to run the pySpark code snippet below:

rdd = sc.parallelize([[1, 0.0], [1, 0.0], [1, 1.0], [1, 1.0], [1, 1.0], [1, 1.0]])
df = rdd.toDF(['id', 'num'])
df.createOrReplaceTempView("df")

with the median calculation using approxQuantile as:

df.approxQuantile("num", [0.5], 0.25)

spark.sql("select percentile_approx(num, 0.5) from df").show()

the results are:

Spark 2.0.0: 0.25
Spark 2.0.1: 1.0
Spark 2.1.0: 1.0

Note, as these are the approximate numbers (via approxQuantile) though in general this should work well. If you need the exact median, one approach is to use numpy.median. The code snippet below is updated for this df example based on gench's SO response to How to find the median in Apache Spark with Python Dataframe API?:

from pyspark.sql.types import *
import pyspark.sql.functions as F
import numpy as np

def find_median(values):
    try:
        median = np.median(values) #get the median of values in a list in each row
        return round(float(median),2)
    except Exception:
        return None #if there is anything wrong with the given values

median_finder = F.udf(find_median,FloatType())

df2 = df.groupBy("id").agg(F.collect_list("num").alias("nums"))
df2 = df2.withColumn("median", median_finder("nums"))

# print out
df2.show()

with the output of:

+---+--------------------+------+
| id|                nums|median|
+---+--------------------+------+
|  1|[0.0, 0.0, 1.0, 1...|   1.0|
+---+--------------------+------+

Updated: Spark 1.6 Scala version using RDDs

If you are using Spark 1.6, you can calculate the median using Scala code via Eugene Zhulenev's response How can I calculate the exact median with Apache Spark. Below is the modified code that works with our example.

import org.apache.spark.SparkContext._

  val rdd: RDD[Double] = sc.parallelize(Seq((0.0), (0.0), (1.0), (1.0), (1.0), (1.0)))

  val sorted = rdd.sortBy(identity).zipWithIndex().map {
    case (v, idx) => (idx, v)
  }

  val count = sorted.count()

  val median: Double = if (count % 2 == 0) {
    val l = count / 2 - 1
    val r = l + 1
    (sorted.lookup(l).head + sorted.lookup(r).head).toDouble / 2
  } else sorted.lookup(count / 2).head.toDouble

with the output of:

// output
import org.apache.spark.SparkContext._
rdd: org.apache.spark.rdd.RDD[Double] = ParallelCollectionRDD[227] at parallelize at <console>:34
sorted: org.apache.spark.rdd.RDD[(Long, Double)] = MapPartitionsRDD[234] at map at <console>:36
count: Long = 6
median: Double = 1.0

Note, this is calculating the exact median using RDDs - i.e. you will need to convert the DataFrame column into an RDD to perform this calculation.

0 讨论(0)