How to correctly get the weights using spark for synthetic dataset?

烈酒焚心 提交于 2020-01-07 03:14:12

问题


I'm doing LogisticRegressionWithSGD on spark for synthetic dataset. I've calculated the error on matlab using vanilla gradient descent and on R which is ~5%. I got similar weight that was used in the model that I used to generate y. The dataset was generated using this example.

While I am able to get very close error rate at the end with different stepsize tuning, the weights for individual feature isn't the same. In fact, it varies a lot. I tried LBFGS for spark and it's able to predict both error and weight correctly in few iterations. My problem is with logistic regression with SGD on spark.

The weight I'm getting:

[0.466521045342,0.699614292387,0.932673108363,0.464446310304,0.231458578991,0.464372487994,0.700369689073,0.928407671516,0.467131704168,0.231629845549,0.46465456877,0.700207596219,0.935570594833,0.465697758292,0.230127949916]

The weight I want:

[2,3,4,2,1,2,3,4,2,1,2,3,4,2,1]

Intercept I'm getting: 0.2638102010832128 Intercept I want: 1

Q.1. Is it the problem with synthetic dataset. I have tried tuning with minBatchFraction, stepSize, iteration and intercept. I couldn't get it right.

Q.2. Why is spark giving me this weird weights? Would it be wrong to expect similar weights from Spark's model?

Please let me know if extra details is needed to answer my question.


回答1:


It actually did converge, your weights are normalized between 0 and 1, while expected max value is for, multiply everything you got from SGD with 4, you can see the correlation even for intercept value.



来源:https://stackoverflow.com/questions/44377796/how-to-correctly-get-the-weights-using-spark-for-synthetic-dataset

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!