Issues with Logistic Regression for multiclass classification using PySpark

前端未结

关注

 1  463

[愿得一人]

I am trying to use Logistic Regression to classify the datasets which has Sparse Vector in feature vector:

For full code base and error log, p

相关标签:

1条回答

鱼传尺愫

2021-01-03 03:33
Case 1: There is nothing strange here, simply (as the error message says) LogisticRegression does not support multi-class classification, as clearly stated in the documentation.

Case 2: Here you have switched from ML to MLlib, which however does not work with dataframes but needs the input as RDD of LabeledPoint (documentation), hence again the error message is expected.

Case 3: Here is where things get interesting. First, you should remove the brackets from your map function, i.e. it should be
```
trainingData = trainingData.map(lambda row: LabeledPoint(row.label, row.features)) # no brackets after "row:"
```
Nevertheless, guessing from the code snippets you have provided, most probably you are going to get a different error now:
```
model = LogisticRegressionWithLBFGS.train(trainingData, numClasses=5)
[...]
: org.apache.spark.SparkException: Input validation failed.
```
Here is what happening (it took me some time to figure it out), using some dummy data (it's always a good idea to provide some sample data with your question):
```
# 3-class classification
data = sc.parallelize([
     LabeledPoint(3.0, SparseVector(100,[10, 98],[1.0, 1.0])),
     LabeledPoint(1.0, SparseVector(100,[1, 22],[1.0, 1.0])),
     LabeledPoint(2.0, SparseVector(100,[36, 54],[1.0, 1.0]))
])

lrm = LogisticRegressionWithLBFGS.train(data, iterations=10, numClasses=3) # throws exception
[...]
: org.apache.spark.SparkException: Input validation failed.
```
The problem is that your labels must start from 0 (and this is nowhere documented - you have to dig in the Scala source code to see that this is the case!); so, mapping the labels in my dummy data above from (1.0, 2.0, 3.0) to (0.0, 1.0, 2.0), we finally get:
```
# 3-class classification
data = sc.parallelize([
     LabeledPoint(2.0, SparseVector(100,[10, 98],[1.0, 1.0])),
     LabeledPoint(0.0, SparseVector(100,[1, 22],[1.0, 1.0])),
     LabeledPoint(1.0, SparseVector(100,[36, 54],[1.0, 1.0]))
])

lrm = LogisticRegressionWithLBFGS.train(data, iterations=10, numClasses=3) # no error now
```
Judging from your numClasses=5 argument, as well as from the label=5.0 in one of your printed records, I guess that most probably your code suffers from the same issue. Change your labels to [0.0, 4.0] and you should be fine.

(I suggest that you delete the other identical question you have opened here, for reducing clutter...)
0 讨论(0)
发布评论:

提交评论
- 加载中...

Issues with Logistic Regression for multiclass classification using PySpark

For full code base and error log, p (adsbygoogle = window.adsbygoogle || []).push({});

For full code base and error log, p