What do columns ‘rawPrediction’ and ‘probability’ of DataFrame mean in Spark MLlib?

前端 未结 3 592
余生分开走
余生分开走 2021-02-05 10:09

After I trained a LogisticRegressionModel, I transformed the test data DF with it and get the prediction DF. And then when I call prediction.show(), the output column names are:

3条回答
  •  情深已故
    2021-02-05 10:56

    Note: please also see the answer below by desertnaut https://stackoverflow.com/a/52947815/1056563

    RawPrediction is typically the direct probability/confidence calculation. From Spark docs:

    Raw prediction for each possible label. The meaning of a "raw" prediction may vary between algorithms, but it intuitively gives a measure of confidence in each possible label (where larger = more confident).

    The Prediction is the result of finding the statistical mode of the rawPrediction - viaargmax`:

      protected def raw2prediction(rawPrediction: Vector): Double =
              rawPrediction.argmax
    

    The Probability is the conditional probability for each class. Here is the scaladoc:

    Estimate the probability of each class given the raw prediction,
    doing the computation in-place. These predictions are also called class conditional probabilities.

    The actual calculation depends on which Classifier you are using.

    DecisionTree

    Normalize a vector of raw predictions to be a multinomial probability vector, in place.

    It simply sums by class across the instances and then divides by the total instance count.

     class_k probability = Count_k/Count_Total
    

    LogisticRegression

    It uses the logistic formula

     class_k probability: 1/(1 + exp(-rawPrediction_k))
    

    Naive Bayes

     class_k probability = exp(max(rawPrediction) - rawPrediction_k)
    

    Random Forest

     class_k probability = Count_k/Count_Total
    

提交回复
热议问题