pyspark 2.2.0 concept behind raw predictions field of logistic regression model

问题

I was trying to understand the concept of the output generated from logistic regression model in Pyspark.

Could anyone please explain the concept behind the rawPrediction field calculation generated from a logistic regression model? Thanks.

回答1:

In older versions of the Spark javadocs (e.g. 1.5.x), there used to be the following explanation:

The meaning of a "raw" prediction may vary between algorithms, but it intuitively gives a measure of confidence in each possible label (where larger = more confident).

It is not there in the later versions, but you can still find it in the Scala source code.

Anyway, and any unfortunate wording aside, the rawPrecictions in Spark ML, for the logistic regression case, is what the rest of the world call logits, i.e. the raw output of a logistic regression classifier, which is subsequently transformed into a probability score using the logistic function exp(x)/(1+exp(x)).

Here is an example with toy data:

spark.version
# u'2.2.0'

from pyspark.ml.classification import LogisticRegression
from pyspark.ml.linalg import Vectors
from pyspark.sql import Row
df = sqlContext.createDataFrame([
     (0.0, Vectors.dense(0.0, 1.0)),
     (1.0, Vectors.dense(1.0, 0.0))], 
     ["label", "features"])
df.show()
# +-----+---------+
# |label| features|
# +-----+---------+
# |  0.0|[0.0,1.0]|
# |  1.0|[1.0,0.0]|
# +-----+---------+

lr = LogisticRegression(maxIter=5, regParam=0.01, labelCol="label")
lr_model = lr.fit(df)

test = sc.parallelize([Row(features=Vectors.dense(0.2, 0.5)),
                       Row(features=Vectors.dense(0.5, 0.2))]).toDF()
lr_result = lr_model.transform(test)
lr_result.show(truncate=False)

Here is the result:

+---------+----------------------------------------+----------------------------------------+----------+ 
|features |                          rawPrediction |                            probability |prediction|
+---------+----------------------------------------+----------------------------------------+----------+ 
|[0.2,0.5]|[0.9894187891647654,-0.9894187891647654]|[0.7289731070426124,0.27102689295738763]|      0.0 |
|[0.5,0.2]|[-0.9894187891647683,0.9894187891647683]|[0.2710268929573871,0.728973107042613]  |      1.0 | 
+---------+----------------------------------------+----------------------------------------+----------+

Let's now confirm that the logistic function of rawPrediction gives the probability column:

import numpy as np

x1 = np.array([0.9894187891647654,-0.9894187891647654])
np.exp(x1)/(1+np.exp(x1))
# array([ 0.72897311, 0.27102689])

x2 = np.array([-0.9894187891647683,0.9894187891647683])
np.exp(x2)/(1+np.exp(x2))
# array([ 0.27102689, 0.72897311])

i.e. this is the case indeed

So, to summarize regarding all three (3) output columns:

rawPrediction is the raw output of the logistic regression classifier (array with length equal to the number of classes)
probability is the result of applying the logistic function to rawPrediction (array of length equal to that of rawPrediction)
prediction is the argument where the array probability takes its maximum value, and it gives the most probable label (single number)

来源：https://stackoverflow.com/questions/48256860/pyspark-2-2-0-concept-behind-raw-predictions-field-of-logistic-regression-model

标签

machine-learning

pyspark

logistic-regression

apache-spark-ml