How to get the probability per instance in classifications models in spark.mllib

问题

I'm using spark.mllib.classification.{LogisticRegressionModel, LogisticRegressionWithSGD} and spark.mllib.tree.RandomForest for classification. Using these packages I produce classification models. Only these models predict a specific class per instance. In Weka, we can get the exact probability for each instance to be of each class. How can we do it using these packages?

In LogisticRegressionModel we can set the threshold. So I've created a function that check the results for each point on a different threshold. But this cannot be done for RandomForest (see How to set cutoff while training the data in Random Forest in Spark)

回答1:

Unfortunately, with MLLIb you can't get the probabilities per instance for classification models till version 1.4.1.

There is JIRA issues (SPARK-4362 and SPARK-6885) concerning this exact topic which is IN PROGRESS as I'm writing the answer now. Nevertheless, the issue seems to be on hold since November 2014

There is currently no way to get the posterior probability of a prediction with Naive Baye's model during prediction. This should be made available along with the label.

And here is a note from @sean-owen on the mailing list on a similar topic regarding the Naive Bayes classification algorithm:

This was recently discussed on this mailing list. You can't get the probabilities out directly now, but you can hack a bit to get the internal data structures of NaiveBayesModel and compute it from there.

Reference : source.

MAJOR EDIT: This issue has been resolved with Spark 1.5.0. Please refer to the JIRA issue for more details.

来源：https://stackoverflow.com/questions/31231514/how-to-get-the-probability-per-instance-in-classifications-models-in-spark-mllib

标签

apache-spark

random-forest

logistic-regression

apache-spark-mllib