How to extract variable weight from spark pipeline logistic model?

问题

I am currently trying to learn Spark Pipeline (Spark 1.6.0). I imported datasets (train and test) as oas.sql.DataFrame objects. After executing the following codes, the produced model is a oas.ml.tuning.CrossValidatorModel.

You can use model.transform (test) to predict based on the test data in Spark. However, I would like to compare the weights that model used to predict with that from R. How to extract the weights of the predictors and intercept (if any) of model? The Scala codes are:

import sqlContext.implicits._
import org.apache.spark.mllib.linalg.{Vectors, Vector}
import org.apache.spark.SparkContext
import org.apache.spark.mllib.regression.LabeledPoint
import org.apache.spark.ml.Pipeline
import org.apache.spark.ml.classification.{LogisticRegression, LogisticRegressionModel}
import org.apache.spark.ml.evaluation.BinaryClassificationEvaluator
import org.apache.spark.ml.tuning.{ParamGridBuilder, CrossValidator}

val conTrain = sc.textFile("AbsolutePath2Train.txt")
val conTest = sc.textFile("AbsolutePath2Test.txt")

// parse text and convert to sql.DataFrame
val train = conTrain.map { line =>
val parts = line.split(",")
LabeledPoint(parts(0).toDouble, Vectors.dense(parts(1).split(" +").map(_.toDouble)))
}.toDF()
val test =conTest.map{ line =>
val parts = line.split(",")
LabeledPoint(parts(0).toDouble, Vectors.dense(parts(1).split(" +").map(_.toDouble)))
}.toDF()

// set parameter space and evaluation method
val lr = new LogisticRegression().setMaxIter(400)
val pipeline = new Pipeline().setStages(Array(lr))
val paramGrid = new ParamGridBuilder().addGrid(lr.regParam, Array(0.1, 0.01)).addGrid(lr.fitIntercept).addGrid(lr.elasticNetParam, Array(0.0, 0.5, 1.0)).build()
val cv = new CrossValidator().setEstimator(pipeline).setEvaluator(new BinaryClassificationEvaluator).setEstimatorParamMaps(paramGrid).setNumFolds(2)

// fit logistic model
val model = cv.fit(train)

// If you want to predict with test
val pred = model.transform(test)

My spark environment is not accessible. Thus, these codes are retyped and rechecked. I hope they are correct. So far, I have tried searching on webs, asking others. About my coding, welcome suggestions, and criticisms.

回答1:

// set parameter space and evaluation method
val lr = new LogisticRegression().setMaxIter(400)
val pipeline = new Pipeline().setStages(Array(lr))
val paramGrid = new ParamGridBuilder().addGrid(lr.regParam, Array(0.1, 0.01)).addGrid(lr.fitIntercept).addGrid(lr.elasticNetParam, Array(0.0, 0.5, 1.0)).build()
val cv = new CrossValidator().setEstimator(pipeline).setEvaluator(new BinaryClassificationEvaluator).setEstimatorParamMaps(paramGrid).setNumFolds(2)
// you can print lr model coefficients as below
val model = cv.bestModel.asInstanceOf[PipelineModel]
val lrModel = model.stages(0).asInstanceOf[LogisticRegressionModel]
println(s"LR Model coefficients:\n${lrModel.coefficients.toArray.mkString("\n")}")

Two steps:

Get the best pipeline from cross validation result.
Get the LR Model from the best pipeline. It's the first stage in your code example.

回答2:

I was looking for exactly the same thing. You might already have the answer, but anyway, here it is.

import org.apache.spark.ml.classification.LogisticRegressionModel
val lrmodel = model.bestModel.asInstanceOf[LogisticRegressionModel]
print(model.weight, model.intercept)

回答3:

I am still not sure about how to extract weights from "model" above. But by restructuring the process towards the official tutorial, the following works on spark-1.6.0:

import org.apache.spark.ml.evaluation.BinaryClassificationEvaluator
import org.apache.spark.ml.tuning.{ParamGridBuilder, TrainValidationSplit}
val lr = new LogisticRegression().setMaxIter(400)
val paramGrid = new ParamGridBuilder().addGrid(lr.regParam, Array(0.1, 0.01)).addGrid(lr.fitIntercept).addGrid(lr.elasticNetParam, Array(0.0, 0.5, 1.0)).build()
val trainValidationSplit = new TrainValidationSplit().setEstimator(lr).setEvaluator(new BinaryClassificationEvaluator).setEstimatorParamMaps(paramGrid).setTrainRatio(0.8)
val restructuredModel = trainValidationSplit.fit(train)
val lrmodel = restructuredModel.bestModel.asInstanceOf[LogisticRegressionModel]
lrmodel.weigths
lrmodel.intercept

I noticed the difference between "lrmodel" here and "model" generated above:

model.bestModel --> gives oas.ml.Model[_] = pipeline_****

restructuredModel.bestModel --> gives oas.ml.Model[_] = logreg_****

That's why we can cast resturcturedModel.bestModel as LogisticRegressionModel but not that of model.bestModel. I'll add more when I understand the reason of the differences.

来源：https://stackoverflow.com/questions/35911831/how-to-extract-variable-weight-from-spark-pipeline-logistic-model

标签

scala

apache-spark

pipeline