问题
With regard to this link Predicted probabilities in R ranger package, I have a question.
Imagine I have a mixed data frame, df (comprising of factor and numeric variables) and I want to do classification using ranger. I am splitting this data frame as test and train sets as Train_Set and Test_Set. BiClass is my prediction factor variable and comprises of 0 and 1 (2 levels)
I want to calculate and attach class probabilities to the data frame using ranger using the following commands:
Biclass.ranger <- ranger(BiClass ~ ., ,data=Train_Set, num.trees = 500, importance="impurity", save.memory = TRUE, probability=TRUE)
probabilities <- as.data.frame(predict(Biclass.ranger, data = Test_Set, num.trees = 200, type='response', verbose = TRUE)$predictions)
The data frame probabilities is a data frame consisting of 2 columns (0 and 1) with number of rows equal to the number of rows in Test_Set.
Does it mean, if I append or attach this data frame, namely, probabilities to the Test_Set as the last two columns, it shows the probability of each row being either 0 or 1? Is my understanding correct?
My second question, when I attempt to calcuate confusion matrix through
pred = predict(Biclass.ranger, data=Test_Set, num.trees = 500, type='response', verbose = TRUE)
table(Test_Set$BiClass, pred$predictions)
I get the following error: Error in table(Test_Set$BiClass, pred$predictions) : all arguments must have the same length
What am I doing wrong?
回答1:
For your first question yes, it shows the probability of each row being 0 or 1. Using the example below:
library(ranger)
idx = sample(nrow(iris),100)
data = iris
data$Species = factor(ifelse(data$Species=="versicolor",1,0))
Train_Set = data[idx,]
Test_Set = data[-idx,]
mdl <- ranger(Species ~ ., ,data=Train_Set,importance="impurity", save.memory = TRUE, probability=TRUE)
probabilities <- as.data.frame(predict(mdl, data = Test_Set,type='response', verbose = TRUE)$predictions)
We can always check whether they agree:
par(mfrow=c(1,2))
boxplot(probabilities[,"0"] ~ Test_Set$Species,ylab="Prob 0",xlab="Actual label")
boxplot(probabilities[,"1"] ~ Test_Set$Species,ylab="Prob 1",xlab="Actual label")
Not the best plot, but sometimes if the labels are flipped you will see something weird. We need to find the column that has the max probability and assign the label, for this we do:
max.col(probabilities) - 1
[1] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 0 1 1 0 1 1 1 1 1 1 1 1 1 0
[39] 0 0 0 0 0 0 0 0 0 0 0 0
This goes through each row of probabilities returns 1 or 2 depending on which column has maximum probability and we simply subtract 1 from it to get 0,1. For the confusion matrix:
caret::confusionMatrix(table(max.col(probabilities) - 1,Test_Set$Species))
Confusion Matrix and Statistics
0 1
0 31 2
1 0 17
Accuracy : 0.96
95% CI : (0.8629, 0.9951)
No Information Rate : 0.62
P-Value [Acc > NIR] : 2.048e-08
In your case, you can just do:
confusionMatrix(table(max.col(probabilities)-1,Test_Set$BiClass))
来源:https://stackoverflow.com/questions/60948668/ranger-predicted-class-probability-of-each-row-in-a-data-frame