Error with training logistic regression model on Apache Spark. SPARK-5063

て烟熏妆下的殇ゞ 提交于 2019-12-22 18:30:43

问题


I am trying to build a Logistic Regression model with Apache Spark. Here is the code.

parsedData = raw_data.map(mapper) # mapper is a function that generates pair of label and feature vector as LabeledPoint object
featureVectors = parsedData.map(lambda point: point.features) # get feature vectors from parsed data 
scaler = StandardScaler(True, True).fit(featureVectors) #this creates a standardization model to scale the features
scaledData = parsedData.map(lambda lp: LabeledPoint(lp.label, scaler.transform(lp.features))) #trasform the features to scale mean to zero and unit std deviation
modelScaledSGD = LogisticRegressionWithSGD.train(scaledData, iterations = 10)

But I get this error:

Exception: It appears that you are attempting to reference SparkContext from a broadcast variable, action, or transforamtion. SparkContext can only be used on the driver, not in code that it run on workers. For more information, see SPARK-5063.

I am not sure how to work around this. Any help would be greately appreciated.


回答1:


Problem you see is pretty much the same as the one I've described in How to use Java/Scala function from an action or a transformation? To transform you have to call Scala function, and it requires access to the SparkContext hence the error you see.

Standard way to handle this is to process only the required part of your data and then zip the results.

labels = parsedData.map(lambda point: point.label)
featuresTransformed = scaler.transform(featureVectors)

scaledData = (labels
    .zip(featuresTransformed)
    .map(lambda p: LabeledPoint(p[0], p[1])))

modelScaledSGD = LogisticRegressionWithSGD.train(...)

If don't plan to implement your own methods based on MLlib components it could easier to use high level ML API.

Edit:

There are two possible problems here.

  1. At this point LogisticRegressionWithSGD supports only binomial classification (Thanks to eliasah for pointing that out) . If you need multi-label classification you can replace it with LogisticRegressionWithLBFGS.
  2. StandardScaler supports only dense vectors so it has limited applications.


来源:https://stackoverflow.com/questions/32196339/error-with-training-logistic-regression-model-on-apache-spark-spark-5063

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!