I am planning to use Linear Regression in Spark. To get started, I checked out the example from the official documentation (which you can find here)
I also found thi
For starters you're missing an intercept. While mean values of the independent variables are close to zero:
parsedData.map(lambda lp: lp.features).mean()
## DenseVector([-0.031, -0.0066, 0.1182, -0.0199, 0.0178, -0.0249,
## -0.0294, 0.0669]
mean of the dependent variable is pretty far from it:
parsedData.map(lambda lp: lp.label).mean()
## 2.452345085074627
Forcing the regression line to go through the origin in case like this doesn't make sense. So lets see how LinearRegressionWithSGD
performs with default arguments and added intercept:
model = LinearRegressionWithSGD.train(parsedData, intercept=True)
valuesAndPreds = (parsedData.map(lambda p: (p.label, model.predict(p.features))))
valuesAndPreds.map(lambda vp: (vp[0] - vp[1]) ** 2).mean()
## 0.44005904185432504
Lets compare it to the analytical solution
import numpy as np
from sklearn import linear_model
features = np.array(parsedData.map(lambda lp: lp.features.toArray()).collect())
labels = np.array(parsedData.map(lambda lp: lp.label).collect())
lm = linear_model.LinearRegression()
lm.fit(features, labels)
np.mean((lm.predict(features) - labels) ** 2)
## 0.43919976805833411
As you can results obtained using LinearRegressionWithSGD
are almost optimal.
You could add a grid search but in this particular case there is probably nothing to gain.