Predicting probabilities of classes in case of Gradient Boosting Trees in Spark using the tree output

限于喜欢 提交于 2020-01-01 05:29:09

问题


It is known that GBT s in Spark gives you predicted labels as of now.

I was thinking of trying to calculate predicted probabilities for a class (say all the instances falling under a certain leaf)

The codes to build GBT's

import org.apache.spark.SparkContext
import org.apache.spark.mllib.regression.LabeledPoint
import org.apache.spark.mllib.linalg.Vectors
import org.apache.spark.mllib.tree.GradientBoostedTrees
import org.apache.spark.mllib.tree.configuration.BoostingStrategy
import org.apache.spark.mllib.tree.model.GradientBoostedTreesModel
import org.apache.spark.mllib.util.MLUtils

//Importing the data
val data = sc.textFile("data/mllib/credit_approval_2_attr.csv") //using the credit approval data set from UCI machine learning repository

//Parsing the data
val parsedData = data.map { line =>
    val parts = line.split(',').map(_.toDouble)
    LabeledPoint(parts(0), Vectors.dense(parts.tail))
}

//Splitting the data
val splits = parsedData.randomSplit(Array(0.7, 0.3), seed = 11L)
val training = splits(0).cache() 
val test = splits(1)

// Train a GradientBoostedTrees model.
// The defaultParams for Classification use LogLoss by default.
val boostingStrategy = BoostingStrategy.defaultParams("Classification")
boostingStrategy.numIterations = 2 // We can use more iterations in practice.
boostingStrategy.treeStrategy.numClasses = 2
boostingStrategy.treeStrategy.maxDepth = 2
boostingStrategy.treeStrategy.maxBins = 32
boostingStrategy.treeStrategy.subsamplingRate = 0.5
boostingStrategy.treeStrategy.maxMemoryInMB =1024
boostingStrategy.learningRate = 0.1

// Empty categoricalFeaturesInfo indicates all features are continuous.
boostingStrategy.treeStrategy.categoricalFeaturesInfo = Map[Int, Int]()

val model = GradientBoostedTrees.train(training, boostingStrategy)  

model.toDebugString

This gives me 2 trees of depth 2 as below for simplicity:

 Tree 0:
    If (feature 3 <= 2.0)
     If (feature 2 <= 1.25)
      Predict: -0.5752212389380531
     Else (feature 2 > 1.25)
      Predict: 0.07462686567164178
    Else (feature 3 > 2.0)
     If (feature 0 <= 30.17)
      Predict: 0.7272727272727273
     Else (feature 0 > 30.17)
      Predict: 1.0
  Tree 1:
    If (feature 5 <= 67.0)
     If (feature 4 <= 100.0)
      Predict: 0.5739387416147804
     Else (feature 4 > 100.0)
      Predict: -0.550117566730937
    Else (feature 5 > 67.0)
     If (feature 2 <= 0.0)
      Predict: 3.0383669122382835
     Else (feature 2 > 0.0)
      Predict: 0.4332824083446489

My question is: Can I use the above trees to calculate predicted probabilities like:

With respect to every instance in the feature set used for prediction

exp(leaf score from tree 0 + leaf score from tree 1)/(1+exp(leaf score from tree 0 + leaf score from tree 1))

This gives me a kind of probability. But not sure if it is the right way to do it. Also if there is any document explaining how leaf score (prediction) are calculated. I would be really grateful if anybody can share.

Any suggestion would be superb.


回答1:


Here is my approach using Spark internal dependencies. You will need to import the linear algebra library for the matrix operation later, i.e., multiplying the tree predictions with the learning rate.

import org.apache.spark.mllib.linalg.{Vectors, Matrices}
import org.apache.spark.mllib.linalg.distributed.{RowMatrix}

Say you build a model with GBT:

val model = GradientBoostedTrees.train(trainingData, boostingStrategy)

To calculate the probability using the model object:

// Get the log odds predictions from each tree
val treePredictions = testData.map { point => model.trees.map(_.predict(point.features)) }

// Transform the arrays into matrices for multiplication
val treePredictionsVector = treePredictions.map(array => Vectors.dense(array))
val treePredictionsMatrix = new RowMatrix(treePredictionsVector)
val learningRate = model.treeWeights
val learningRateMatrix = Matrices.dense(learningRate.size, 1, learningRate)
val weightedTreePredictions = treePredictionsMatrix.multiply(learningRateMatrix)

// Calculate probability by ensembling the log odds
val classProb = weightedTreePredictions.rows.flatMap(_.toArray).map(x => 1 / (1 + Math.exp(-1 * x)))
classProb.collect

// You may tweak your decision boundary for different class labels
val classLabel = classProb.map(x => if (x > 0.5) 1.0 else 0.0)
classLabel.collect

Here is a code snippet you can copy & paste directly into spark-shell:

import org.apache.spark.mllib.regression.LabeledPoint
import org.apache.spark.mllib.linalg.{Vectors, Matrices}
import org.apache.spark.mllib.linalg.distributed.{RowMatrix}
import org.apache.spark.mllib.tree.GradientBoostedTrees
import org.apache.spark.mllib.tree.configuration.BoostingStrategy
import org.apache.spark.mllib.tree.model.GradientBoostedTreesModel

// Load and parse the data file.
val csvData = sc.textFile("data/mllib/sample_tree_data.csv")
val data = csvData.map { line =>
  val parts = line.split(',').map(_.toDouble)
  LabeledPoint(parts(0), Vectors.dense(parts.tail))
}
// Split the data into training and test sets (30% held out for testing)
val splits = data.randomSplit(Array(0.7, 0.3))
val (trainingData, testData) = (splits(0), splits(1))

// Train a GBT model.
val boostingStrategy = BoostingStrategy.defaultParams("Classification")
boostingStrategy.numIterations = 50
boostingStrategy.treeStrategy.numClasses = 2
boostingStrategy.treeStrategy.maxDepth = 6
boostingStrategy.treeStrategy.categoricalFeaturesInfo = Map[Int, Int]()

val model = GradientBoostedTrees.train(trainingData, boostingStrategy)

// Get class label from raw predict function
val predictedLabels = model.predict(testData.map(_.features))
predictedLabels.collect

// Get class probability
val treePredictions = testData.map { point => model.trees.map(_.predict(point.features)) }
val treePredictionsVector = treePredictions.map(array => Vectors.dense(array))
val treePredictionsMatrix = new RowMatrix(treePredictionsVector)
val learningRate = model.treeWeights
val learningRateMatrix = Matrices.dense(learningRate.size, 1, learningRate)
val weightedTreePredictions = treePredictionsMatrix.multiply(learningRateMatrix)
val classProb = weightedTreePredictions.rows.flatMap(_.toArray).map(x => 1 / (1 + Math.exp(-1 * x)))
val classLabel = classProb.map(x => if (x > 0.5) 1.0 else 0.0)
classLabel.collect



回答2:


def score(features: Vector,gbdt: GradientBoostedTreesModel): Double = {
    val treePredictions = gbdt.trees.map(_.predict(features))
    blas.ddot(gbdt.numTrees, treePredictions, 1, gbdt.treeWeights, 1)
}
def sigmoid(v : Double) : Double = {
    1/(1+Math.exp(-v))
}
// model is output of GradientBoostedTrees.train(...,...)
// testData is libSVM format
val labelAndPreds = testData.map { point =>
        var prediction = score(point.features,model)
        prediction = sigmoid(prediction)
        (point.label, Vectors.dense(1.0-prediction, prediction))
}



回答3:


Actually I was able predict the probabilities using the tree and the formulation of the tree given in the question. I actually checked with the GBT predicted labels output. It matches exactly when I use threshold as 0.5.

So we do the same with a slight change.

With respect to every instance in the feature set used for prediction:

exp(leaf score from tree 0 + (learning_rate)* leaf score from tree 1)/(1+exp(leaf score from tree 0 + (learning_rate)* leaf score from tree 1))

This essentially gives me the predicted probabilities.

I tested the same on 3 trees with depth 3. It worked. And also with different data sets.

It would be great to know if anyone else have already tried this. If not, they can try this and comment.




回答4:


In fact, the above ans is wrong, sigmoid function is false in this situation for spark translate label into {-1,1}. You should use a code like this:

def score(features: Vector,gbdt: GradientBoostedTreesModel): Double = {
    val treePredictions = gbdt.trees.map(_.predict(features))
    blas.ddot(gbdt.numTrees, treePredictions, 1, gbdt.treeWeights, 1)
}
val labelAndPreds = testData.map { point =>
        var prediction = score(point.features,model)
        prediction = 1.0 / (1.0 + math.exp(-2.0 * prediction))
        (point.label, Vectors.dense(1.0-prediction, prediction))
}

The more detail can be seen in page 9 of "Greedy Function Approximation? A Gradient Boosting Machine". And a pull request in spark: https://github.com/apache/spark/pull/16441




回答5:


In fact ,@hbghhy saw is wrong ,@Run2 is right ,Spark use twice the binomial negative log likelihood as Loss ,but Friedman use binomial negative log likelihood as Loss in page 9 of "Greedy Function Approximation" .

/**
 * :: DeveloperApi ::
 * Class for log loss calculation (for classification).
 * This uses twice the binomial negative log likelihood, called "deviance" in Friedman (1999).
 *
 * The log loss is defined as:
 *   2 log(1 + exp(-2 y F(x)))
 * where y is a label in {-1, 1} and F(x) is the model prediction for features x.
 */
@Since("1.2.0")
@DeveloperApi
object LogLoss extends ClassificationLoss {

  /**
   * Method to calculate the loss gradients for the gradient boosting calculation for binary
   * classification
   * The gradient with respect to F(x) is: - 4 y / (1 + exp(2 y F(x)))
   * @param prediction Predicted label.
   * @param label True label.
   * @return Loss gradient
   */
  @Since("1.2.0")
  override def gradient(prediction: Double, label: Double): Double = {
    - 4.0 * label / (1.0 + math.exp(2.0 * label * prediction))
  }

  override private[spark] def computeError(prediction: Double, label: Double): Double = {
    val margin = 2.0 * label * prediction
    // The following is equivalent to 2.0 * log(1 + exp(-margin)) but more numerically stable.
    2.0 * MLUtils.log1pExp(-margin)
  }
}


来源:https://stackoverflow.com/questions/37303855/predicting-probabilities-of-classes-in-case-of-gradient-boosting-trees-in-spark

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!