Issues with Logistic Regression for multiclass classification using PySpark

前端 未结 1 463
[愿得一人]
[愿得一人] 2021-01-03 03:18

I am trying to use Logistic Regression to classify the datasets which has Sparse Vector in feature vector:

For full code base and error log, p

相关标签:
1条回答
  • 2021-01-03 03:33

    Case 1: There is nothing strange here, simply (as the error message says) LogisticRegression does not support multi-class classification, as clearly stated in the documentation.

    Case 2: Here you have switched from ML to MLlib, which however does not work with dataframes but needs the input as RDD of LabeledPoint (documentation), hence again the error message is expected.

    Case 3: Here is where things get interesting. First, you should remove the brackets from your map function, i.e. it should be

    trainingData = trainingData.map(lambda row: LabeledPoint(row.label, row.features)) # no brackets after "row:"
    

    Nevertheless, guessing from the code snippets you have provided, most probably you are going to get a different error now:

    model = LogisticRegressionWithLBFGS.train(trainingData, numClasses=5)
    [...]
    : org.apache.spark.SparkException: Input validation failed.
    

    Here is what happening (it took me some time to figure it out), using some dummy data (it's always a good idea to provide some sample data with your question):

    # 3-class classification
    data = sc.parallelize([
         LabeledPoint(3.0, SparseVector(100,[10, 98],[1.0, 1.0])),
         LabeledPoint(1.0, SparseVector(100,[1, 22],[1.0, 1.0])),
         LabeledPoint(2.0, SparseVector(100,[36, 54],[1.0, 1.0]))
    ])
    
    lrm = LogisticRegressionWithLBFGS.train(data, iterations=10, numClasses=3) # throws exception
    [...]
    : org.apache.spark.SparkException: Input validation failed.
    

    The problem is that your labels must start from 0 (and this is nowhere documented - you have to dig in the Scala source code to see that this is the case!); so, mapping the labels in my dummy data above from (1.0, 2.0, 3.0) to (0.0, 1.0, 2.0), we finally get:

    # 3-class classification
    data = sc.parallelize([
         LabeledPoint(2.0, SparseVector(100,[10, 98],[1.0, 1.0])),
         LabeledPoint(0.0, SparseVector(100,[1, 22],[1.0, 1.0])),
         LabeledPoint(1.0, SparseVector(100,[36, 54],[1.0, 1.0]))
    ])
    
    lrm = LogisticRegressionWithLBFGS.train(data, iterations=10, numClasses=3) # no error now
    

    Judging from your numClasses=5 argument, as well as from the label=5.0 in one of your printed records, I guess that most probably your code suffers from the same issue. Change your labels to [0.0, 4.0] and you should be fine.

    (I suggest that you delete the other identical question you have opened here, for reducing clutter...)

    0 讨论(0)
提交回复
热议问题