After I trained a LogisticRegressionModel, I transformed the test data DF with it and get the prediction DF. And then when I call prediction.show(), the output column names are:
In older versions of the Spark javadocs (e.g. 1.5.x), there used to be the following explanation:
The meaning of a "raw" prediction may vary between algorithms, but it intuitively gives a measure of confidence in each possible label (where larger = more confident).
It is not there in the later versions, but you can still find it in the Scala source code.
Anyway, and any unfortunate wording aside, the rawPrecictions
in Spark ML, for the logistic regression case, is what the rest of the world call logits, i.e. the raw output of a logistic regression classifier, which is subsequently transformed into a probability score using the logistic function exp(x)/(1+exp(x))
.
Here is an example with toy data in Pyspark:
spark.version
# u'2.2.0'
from pyspark.ml.classification import LogisticRegression
from pyspark.ml.linalg import Vectors
from pyspark.sql import Row
df = sqlContext.createDataFrame([
(0.0, Vectors.dense(0.0, 1.0)),
(1.0, Vectors.dense(1.0, 0.0))],
["label", "features"])
df.show()
# +-----+---------+
# |label| features|
# +-----+---------+
# | 0.0|[0.0,1.0]|
# | 1.0|[1.0,0.0]|
# +-----+---------+
lr = LogisticRegression(maxIter=5, regParam=0.01, labelCol="label")
lr_model = lr.fit(df)
test = sc.parallelize([Row(features=Vectors.dense(0.2, 0.5)),
Row(features=Vectors.dense(0.5, 0.2))]).toDF()
lr_result = lr_model.transform(test)
lr_result.show(truncate=False)
Here is the result:
+---------+----------------------------------------+----------------------------------------+----------+
|features | rawPrediction | probability |prediction|
+---------+----------------------------------------+----------------------------------------+----------+
|[0.2,0.5]|[0.9894187891647654,-0.9894187891647654]|[0.7289731070426124,0.27102689295738763]| 0.0 |
|[0.5,0.2]|[-0.9894187891647683,0.9894187891647683]|[0.2710268929573871,0.728973107042613] | 1.0 |
+---------+----------------------------------------+----------------------------------------+----------+
Let's now confirm that the logistic function of rawPrediction
gives the probability
column:
import numpy as np
x1 = np.array([0.9894187891647654,-0.9894187891647654])
np.exp(x1)/(1+np.exp(x1))
# array([ 0.72897311, 0.27102689])
x2 = np.array([-0.9894187891647683,0.9894187891647683])
np.exp(x2)/(1+np.exp(x2))
# array([ 0.27102689, 0.72897311])
i.e. this is the case indeed
So, to summarize regarding all three (3) output columns:
rawPrediction
is the raw output of the logistic regression classifier (array with length equal to the number of classes)probability
is the result of applying the logistic function to rawPrediction
(array of length equal to that of rawPrediction
)prediction
is the argument where the array probability
takes its maximum value, and it gives the most probable label (single number)Note: please also see the answer below by desertnaut https://stackoverflow.com/a/52947815/1056563
RawPrediction
is typically the direct probability/confidence calculation. From Spark docs:
Raw prediction for each possible label. The meaning of a "raw" prediction may vary between algorithms, but it intuitively gives a measure of confidence in each possible label (where larger = more confident).
The Prediction
is the result of finding the statistical mode
of the rawPrediction - via
argmax`:
protected def raw2prediction(rawPrediction: Vector): Double =
rawPrediction.argmax
The Probability
is the conditional probability
for each class. Here is the scaladoc
:
Estimate the probability of each class given the raw prediction,
doing the computation in-place. These predictions are also called class conditional probabilities.
The actual calculation depends on which Classifier
you are using.
DecisionTree
Normalize a vector of raw predictions to be a multinomial probability vector, in place.
It simply sums by class across the instances and then divides by the total instance count.
class_k probability = Count_k/Count_Total
LogisticRegression
It uses the logistic formula
class_k probability: 1/(1 + exp(-rawPrediction_k))
Naive Bayes
class_k probability = exp(max(rawPrediction) - rawPrediction_k)
Random Forest
class_k probability = Count_k/Count_Total
If classification model is logistic regression,
rawPrediction is equal (w*x + bias) variable coefficients values
probability is 1/(1+e^(w*x + bias))
prediction is 0 or 1.