I\'m trying to extract the class probabilities of a random forest object I have trained using PySpark. However, I do not see an example of it anywhere in the documentation, nor
Probably people would have moved on with this post, but i was hit by the same problem today when trying to compute the accuracy for the multi-class classifier against a training set. So I thought I share my experience if someone is trying with mllib ...
probability can be computed fairly easy as follows:-
# say you have a testset against which you want to run your classifier
(trainingset, testset) =data.randomSplit([0.7, 0.3])
# I converted the spark dataset containing the test data to pandas
ptd=testData.toPandas()
#Now get a count of number of labels matching the predictions
correct = ((ptd.label-1) == (predictions)).sum()
# here we had to change the labels from 0-9 as opposed to 1-10 since
#labels take the values from 0 .. numClasses-1
m=ptd.shape[0]
print((correct/m)*100)