问题
I am trying to build a Logistic Regression model with Apache Spark. Here is the code.
parsedData = raw_data.map(mapper) # mapper is a function that generates pair of label and feature vector as LabeledPoint object
featureVectors = parsedData.map(lambda point: point.features) # get feature vectors from parsed data
scaler = StandardScaler(True, True).fit(featureVectors) #this creates a standardization model to scale the features
scaledData = parsedData.map(lambda lp: LabeledPoint(lp.label, scaler.transform(lp.features))) #trasform the features to scale mean to zero and unit std deviation
modelScaledSGD = LogisticRegressionWithSGD.train(scaledData, iterations = 10)
But I get this error:
Exception: It appears that you are attempting to reference SparkContext from a broadcast variable, action, or transforamtion. SparkContext can only be used on the driver, not in code that it run on workers. For more information, see SPARK-5063.
I am not sure how to work around this. Any help would be greately appreciated.
回答1:
Problem you see is pretty much the same as the one I've described in How to use Java/Scala function from an action or a transformation? To transform you have to call Scala function, and it requires access to the SparkContext
hence the error you see.
Standard way to handle this is to process only the required part of your data and then zip the results.
labels = parsedData.map(lambda point: point.label)
featuresTransformed = scaler.transform(featureVectors)
scaledData = (labels
.zip(featuresTransformed)
.map(lambda p: LabeledPoint(p[0], p[1])))
modelScaledSGD = LogisticRegressionWithSGD.train(...)
If don't plan to implement your own methods based on MLlib
components it could easier to use high level ML API.
Edit:
There are two possible problems here.
- At this point
LogisticRegressionWithSGD
supports only binomial classification (Thanks to eliasah for pointing that out) . If you need multi-label classification you can replace it withLogisticRegressionWithLBFGS
. StandardScaler
supports only dense vectors so it has limited applications.
来源:https://stackoverflow.com/questions/32196339/error-with-training-logistic-regression-model-on-apache-spark-spark-5063