What do columns ‘rawPrediction’ and ‘probability’ of DataFrame mean in Spark MLlib？

前端未结

关注

 3  592

余生分开走 2021-02-05 10:09

After I trained a LogisticRegressionModel, I transformed the test data DF with it and get the prediction DF. And then when I call prediction.show(), the output column names are:

3条回答

情深已故 (楼主)

2021-02-05 10:56
Note: please also see the answer below by desertnaut https://stackoverflow.com/a/52947815/1056563

RawPrediction is typically the direct probability/confidence calculation. From Spark docs:

Raw prediction for each possible label. The meaning of a "raw" prediction may vary between algorithms, but it intuitively gives a measure of confidence in each possible label (where larger = more confident).

The Prediction is the result of finding the statistical mode of the rawPrediction - viaargmax`:
```
  protected def raw2prediction(rawPrediction: Vector): Double =
          rawPrediction.argmax
```
The Probability is the conditional probability for each class. Here is the scaladoc:

Estimate the probability of each class given the raw prediction,
doing the computation in-place. These predictions are also called class conditional probabilities.

The actual calculation depends on which Classifier you are using.

DecisionTree

Normalize a vector of raw predictions to be a multinomial probability vector, in place.

It simply sums by class across the instances and then divides by the total instance count.
```
 class_k probability = Count_k/Count_Total
```
LogisticRegression

It uses the logistic formula
```
 class_k probability: 1/(1 + exp(-rawPrediction_k))
```
Naive Bayes
```
 class_k probability = exp(max(rawPrediction) - rawPrediction_k)
```
Random Forest
```
 class_k probability = Count_k/Count_Total
```
0 讨论(0)

查看其它3个回答
发布评论:

提交评论
- 加载中...