Spark MlLib linear regression (Linear least squares) giving random results

匿名 (未验证) 提交于 2019-12-03 01:23:02

问题:

Im new in spark and Machine learning in general. I have followed with success some of the Mllib tutorials, i can't get this one working:

i found the sample code here : https://spark.apache.org/docs/latest/mllib-linear-methods.html#linear-least-squares-lasso-and-ridge-regression

(section LinearRegressionWithSGD)

here is the code:

import org.apache.spark.mllib.regression.LabeledPoint import org.apache.spark.mllib.regression.LinearRegressionModel import org.apache.spark.mllib.regression.LinearRegressionWithSGD import org.apache.spark.mllib.linalg.Vectors  // Load and parse the data val data = sc.textFile("data/mllib/ridge-data/lpsa.data") val parsedData = data.map { line =>   val parts = line.split(',')   LabeledPoint(parts(0).toDouble, Vectors.dense(parts(1).split(' ').map(_.toDouble))) }.cache()  // Building the model val numIterations = 100 val model = LinearRegressionWithSGD.train(parsedData, numIterations)  // Evaluate model on training examples and compute training error val valuesAndPreds = parsedData.map { point =>   val prediction = model.predict(point.features)   (point.label, prediction) } val MSE = valuesAndPreds.map{case(v, p) => math.pow((v - p), 2)}.mean() println("training Mean Squared Error = " + MSE)  // Save and load model model.save(sc, "myModelPath") val sameModel = LinearRegressionModel.load(sc, "myModelPath") 

(that's exactly what's is on the website)

The result is

training Mean Squared Error = 6.2087803138063045

and

valuesAndPreds.collect 

gives

    Array[(Double, Double)] = Array((-0.4307829,-1.8383286021929077),  (-0.1625189,-1.4955700806407322), (-0.1625189,-1.118820892849544),  (-0.1625189,-1.6134108278724875), (0.3715636,-0.45171266551058276),  (0.7654678,-1.861316066986158), (0.8544153,-0.3588282725617985),  (1.2669476,-0.5036812148225209), (1.2669476,-1.1534698170911792),  (1.2669476,-0.3561392231695041), (1.3480731,-0.7347031705813306),  (1.446919,-0.08564658011814863), (1.4701758,-0.656725375080344),  (1.4929041,-0.14020483324910105), (1.5581446,-1.9438858658143454),  (1.5993876,-0.02181165554398845), (1.6389967,-0.3778677315868635),  (1.6956156,-1.1710092824030043), (1.7137979,0.27583044213064634),  (1.8000583,0.7812664902440078), (1.8484548,0.94605507153074),  (1.8946169,-0.7217282082851512), (1.9242487,-0.24422843221437684),... 

My problem here is predictions looks totally random (and wrong), and since its the perfect copy of the website example, with the same input data (training set), i don't know where to look, am i missing something ?

Please give me some advices or clue about where to search, i can read and experiment.

Thanks

回答1:

Linear Regression is SGD based and requires tweaking the step size, see http://spark.apache.org/docs/latest/mllib-optimization.html for more details.

In your example, if you set the step size to 0.1 you get better results (MSE = 0.5).

import org.apache.spark.mllib.regression.LabeledPoint import org.apache.spark.mllib.regression.LinearRegressionModel import org.apache.spark.mllib.regression.LinearRegressionWithSGD import org.apache.spark.mllib.linalg.Vectors  // Load and parse the data val data = sc.textFile("data/mllib/ridge-data/lpsa.data") val parsedData = data.map { line =>   val parts = line.split(',')   LabeledPoint(parts(0).toDouble, Vectors.dense(parts(1).split(' ').map(_.toDouble))) }.cache()  // Build the model var regression = new LinearRegressionWithSGD().setIntercept(true) regression.optimizer.setStepSize(0.1) val model = regression.run(parsedData)  // Evaluate model on training examples and compute training error val valuesAndPreds = parsedData.map { point =>   val prediction = model.predict(point.features)   (point.label, prediction) } val MSE = valuesAndPreds.map{case(v, p) => math.pow((v - p), 2)}.mean() println("training Mean Squared Error = " + MSE) 

For another example on a more realistic dataset, see

https://github.com/selvinsource/spark-pmml-exporter-validator/blob/master/src/main/resources/datasets/winequalityred_linearregression.md

https://github.com/selvinsource/spark-pmml-exporter-validator/blob/master/src/main/resources/spark_shell_exporter/linearregression_winequalityred.scala



回答2:

As explained by zero323 here, setting the intercept to true will solve the problem. If not set to true, your regression line is forced to go through the origin, which is not appropriate in this case. (Not sure, why this is not included in the sample code)

So, to fix your problem, change the following line in your code (Pyspark):

model = LinearRegressionWithSGD.train(parsedData, numIterations) 

to

model = LinearRegressionWithSGD.train(parsedData, numIterations, intercept=True) 

Although not mentioned explicitly, this is also why the code from 'selvinsource' in the above question is working. Changing the step size doesn't help much in this example.



标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!