问题
Here is my function that calculates root mean squared error. However the last line cannot be compiled because of the error Type mismatch issue (expected: Double, actual: Unit)
. I tried many different ways to solve this issue, but still without success. Any ideas?
def calculateRMSE(output: DStream[(Double, Double)]): Double = {
val summse = output.foreachRDD { rdd =>
rdd.map {
case pair: (Double, Double) =>
val err = math.abs(pair._1 - pair._2);
err*err
}.reduce(_ + _)
}
// math.sqrt(summse) HOW TO APPLY SQRT HERE?
}
回答1:
As eliasah pointed out, foreach
(and foreachRDD
) don't return a value; they are for side-effects only. If you wanted to return something, you need map
. Based off your second solution:
val rmse = output.map(rdd => new RegressionMetrics(rdd).rootMeanSquaredError)
It looks better if you make a little function for it:
val getRmse = (rdd: RDD) => new RegressionMetrics(rdd).rootMeanSquaredError
val rmse = output.map(getRmse)
Ignoring empty RDDs,
val rmse = output.filter(_.nonEmpty).map(getRmse)
Here is the exact same sequence as a for-comprehension. It's just syntactic sugar for map, flatMap and filter, but I thought it was much easier to understand when I was first learning Scala:
val rmse = for {
rdd <- output
if (rdd.nonEmpty)
} yield new RegressionMetrics(rdd).rootMeanSquaredError
And here's a function summing the errors, like your first attempt:
def calculateRmse(output: DStream[(Double, Double)]): Double = {
val getRmse = (rdd: RDD) => new RegressionMetrics(rdd).rootMeanSquaredError
output.filter(_.nonEmpty).map(getRmse).reduce(_+_)
}
The compiler's complaint about nonEmpty
is actually an issue with DStream's filter
method. Instead of operating on the RDDs in the DStream, filter
is operating on the pairs of doubles (Double, Double)
given by your DStream's type parameter.
I don't know enough about Spark to say it's a flaw, but it is very strange. Filter
and most other operations over collections are typically defined in terms of foreach, but DStream implements those functions without following the same convention; its deprecated method foreach
and current foreachRDD
both operate over the stream's RDDs, but its other higher-order methods don't.
So my method won't work. DStream probably has a good reason for being weird (performance related?) Here's probably bad way to do it with foreach
:
def calculateRmse(ds: DStream[(Double, Double)]): Double = {
var totalError: Double = 0
def getRmse(rdd:RDD[(Double, Double)]): Double = new RegressionMetrics(rdd).rootMeanSquaredError
ds.foreachRDD((rdd:RDD[(Double, Double)]) => if (!rdd.isEmpty) totalError += getRmse(rdd))
totalError
}
But it works!
回答2:
I managed to do this task as follows:
import org.apache.spark.mllib.evaluation.RegressionMetrics
output.foreachRDD { rdd =>
if (!rdd.isEmpty)
{
val metrics = new RegressionMetrics(rdd)
val rmse = metrics.rootMeanSquaredError
println("RMSE: " + rmse)
}
}
来源:https://stackoverflow.com/questions/36984923/how-to-solve-type-mismatch-issue-expected-double-actual-unit