How to solve Type mismatch issue (expected: Double, actual: Unit)

问题

Here is my function that calculates root mean squared error. However the last line cannot be compiled because of the error Type mismatch issue (expected: Double, actual: Unit). I tried many different ways to solve this issue, but still without success. Any ideas?

  def calculateRMSE(output: DStream[(Double, Double)]): Double = {
        val summse = output.foreachRDD { rdd =>
          rdd.map {
              case pair: (Double, Double) =>
                val err = math.abs(pair._1 - pair._2);
                err*err
          }.reduce(_ + _)
        }
        // math.sqrt(summse)  HOW TO APPLY SQRT HERE?
  }

回答1:

As eliasah pointed out, foreach (and foreachRDD) don't return a value; they are for side-effects only. If you wanted to return something, you need map. Based off your second solution:

val rmse = output.map(rdd => new RegressionMetrics(rdd).rootMeanSquaredError)

It looks better if you make a little function for it:

val getRmse = (rdd: RDD) => new RegressionMetrics(rdd).rootMeanSquaredError

val rmse = output.map(getRmse)

Ignoring empty RDDs,

val rmse = output.filter(_.nonEmpty).map(getRmse)

Here is the exact same sequence as a for-comprehension. It's just syntactic sugar for map, flatMap and filter, but I thought it was much easier to understand when I was first learning Scala:

val rmse = for {
  rdd <- output
  if (rdd.nonEmpty)
} yield new RegressionMetrics(rdd).rootMeanSquaredError

And here's a function summing the errors, like your first attempt:

def calculateRmse(output: DStream[(Double, Double)]): Double = {

val getRmse = (rdd: RDD) => new RegressionMetrics(rdd).rootMeanSquaredError

output.filter(_.nonEmpty).map(getRmse).reduce(_+_)
}

The compiler's complaint about nonEmpty is actually an issue with DStream's filter method. Instead of operating on the RDDs in the DStream, filter is operating on the pairs of doubles (Double, Double) given by your DStream's type parameter.

I don't know enough about Spark to say it's a flaw, but it is very strange. Filter and most other operations over collections are typically defined in terms of foreach, but DStream implements those functions without following the same convention; its deprecated method foreach and current foreachRDD both operate over the stream's RDDs, but its other higher-order methods don't.

So my method won't work. DStream probably has a good reason for being weird (performance related?) Here's probably bad way to do it with foreach:

def calculateRmse(ds: DStream[(Double, Double)]): Double = {

  var totalError: Double = 0

  def getRmse(rdd:RDD[(Double, Double)]): Double = new RegressionMetrics(rdd).rootMeanSquaredError

  ds.foreachRDD((rdd:RDD[(Double, Double)]) => if (!rdd.isEmpty) totalError += getRmse(rdd))

  totalError
}

But it works!

回答2:

I managed to do this task as follows:

import org.apache.spark.mllib.evaluation.RegressionMetrics

output.foreachRDD { rdd =>
  if (!rdd.isEmpty)
    {
      val metrics = new RegressionMetrics(rdd)
      val rmse = metrics.rootMeanSquaredError
      println("RMSE: " + rmse)
    }
}

来源：https://stackoverflow.com/questions/36984923/how-to-solve-type-mismatch-issue-expected-double-actual-unit

标签

scala

apache-spark

rdd

dstream