How can I calculate exact median with Apache Spark?

前端 未结 2 1627
醉酒成梦
醉酒成梦 2020-12-06 05:30

This page contains some statistics functions (mean, stdev, variance, etc.) but it does not contain the median. How can I calculate exact median?

相关标签:
2条回答
  • 2020-12-06 05:54

    You need to sort RDD and take element in the middle or average of two elements. Here is example with RDD[Int]:

      import org.apache.spark.SparkContext._
    
      val rdd: RDD[Int] = ???
    
      val sorted = rdd.sortBy(identity).zipWithIndex().map {
        case (v, idx) => (idx, v)
      }
    
      val count = sorted.count()
    
      val median: Double = if (count % 2 == 0) {
        val l = count / 2 - 1
        val r = l + 1
        (sorted.lookup(l).head + sorted.lookup(r).head).toDouble / 2
      } else sorted.lookup(count / 2).head.toDouble
    
    0 讨论(0)
  • 2020-12-06 06:13

    Using Spark 2.0+ and the DataFrame API you can use the approxQuantile method:

    def approxQuantile(col: String, probabilities: Array[Double], relativeError: Double)
    

    It will also work on multiple columns at the same time since Spark version 2.2. By setting probabilites to Array(0.5) and relativeError to 0, it will compute the exact median. From the documentation:

    The relative target precision to achieve (greater than or equal to 0). If set to zero, the exact quantiles are computed, which could be very expensive.

    Despite this, there seems to be some issues with the precision when setting relativeError to 0, see the question here. A low error close to 0 will in some instances work better (will depend on Spark version).


    A small working example which calculates the median of the numbers from 1 to 99 (both inclusive) and uses a low relativeError:

    val df = (1 to 99).toDF("num")
    val median = df.stat.approxQuantile("num", Array(0.5), 0.001)(0)
    println(median)
    

    The median returned is 50.0.

    0 讨论(0)
提交回复
热议问题