pyspark Linear Regression Example from official documentation - Bad results?

后端未结

关注

 1  394

I am planning to use Linear Regression in Spark. To get started, I checked out the example from the official documentation (which you can find here)

I also found thi

相关标签:

1条回答

北荒

2020-12-22 01:31

For starters you're missing an intercept. While mean values of the independent variables are close to zero:

parsedData.map(lambda lp: lp.features).mean()
## DenseVector([-0.031, -0.0066, 0.1182, -0.0199, 0.0178, -0.0249,
##     -0.0294, 0.0669]

mean of the dependent variable is pretty far from it:

parsedData.map(lambda lp: lp.label).mean()
## 2.452345085074627

Forcing the regression line to go through the origin in case like this doesn't make sense. So lets see how LinearRegressionWithSGD performs with default arguments and added intercept:

model = LinearRegressionWithSGD.train(parsedData, intercept=True)
valuesAndPreds = (parsedData.map(lambda p: (p.label, model.predict(p.features))))
valuesAndPreds.map(lambda vp: (vp[0] - vp[1]) ** 2).mean()
## 0.44005904185432504

Lets compare it to the analytical solution

import numpy as np
from sklearn import linear_model

features = np.array(parsedData.map(lambda lp: lp.features.toArray()).collect())
labels = np.array(parsedData.map(lambda lp: lp.label).collect())

lm = linear_model.LinearRegression()
lm.fit(features, labels)
np.mean((lm.predict(features) - labels) ** 2)
## 0.43919976805833411

As you can results obtained using LinearRegressionWithSGD are almost optimal.

You could add a grid search but in this particular case there is probably nothing to gain.

0 讨论(0)