pyspark Linear Regression Example from official documentation - Bad results?

后端 未结 1 394
一向
一向 2020-12-22 01:07

I am planning to use Linear Regression in Spark. To get started, I checked out the example from the official documentation (which you can find here)

I also found thi

相关标签:
1条回答
  • 2020-12-22 01:31

    For starters you're missing an intercept. While mean values of the independent variables are close to zero:

    parsedData.map(lambda lp: lp.features).mean()
    ## DenseVector([-0.031, -0.0066, 0.1182, -0.0199, 0.0178, -0.0249,
    ##     -0.0294, 0.0669]
    

    mean of the dependent variable is pretty far from it:

    parsedData.map(lambda lp: lp.label).mean()
    ## 2.452345085074627
    

    Forcing the regression line to go through the origin in case like this doesn't make sense. So lets see how LinearRegressionWithSGD performs with default arguments and added intercept:

    model = LinearRegressionWithSGD.train(parsedData, intercept=True)
    valuesAndPreds = (parsedData.map(lambda p: (p.label, model.predict(p.features))))
    valuesAndPreds.map(lambda vp: (vp[0] - vp[1]) ** 2).mean()
    ## 0.44005904185432504
    

    Lets compare it to the analytical solution

    import numpy as np
    from sklearn import linear_model
    
    features = np.array(parsedData.map(lambda lp: lp.features.toArray()).collect())
    labels = np.array(parsedData.map(lambda lp: lp.label).collect())
    
    lm = linear_model.LinearRegression()
    lm.fit(features, labels)
    np.mean((lm.predict(features) - labels) ** 2)
    ## 0.43919976805833411
    

    As you can results obtained using LinearRegressionWithSGD are almost optimal.

    You could add a grid search but in this particular case there is probably nothing to gain.

    0 讨论(0)
提交回复
热议问题