TL;DR :
Is there something I can flag in the original randomForest
call to avoid having to re-run the
model$predicted
is NOT the same thing returned by predict()
. If you want the probability of the TRUE
or FALSE
class then you must run predict()
, or pass x,y,xtest,ytest
like
randomForest(x,y,xtest=x,ytest=y),
where x=out.data[, feature.cols], y=out.data[, response.col]
.
model$predicted
returns the class based on which class had the larger value in model$votes
for each record. votes
, as @joran pointed out is the proportion of OOB(out of bag) ‘votes’ from the random forest, a vote only counting when the record was selected in an OOB sample. On the other hand predict()
returns the true probability for each class based on votes by all the trees.
Using randomForest(x,y,xtest=x,ytest=y)
functions a little differently than when passing a formula or simply randomForest(x,y)
, as in the example given above. randomForest(x,y,xtest=x,ytest=y)
WILL return the probability for each class, this may sound a little weird, but it is found under model$test$votes
, and the predicted class under model$test$predicted
, which simply selects the class based on which class had the larger value in model$test$votes
. Also, when using randomForest(x,y,xtest=x,ytest=y)
, model$predicted
and model$votes
have the same definition as above.
Finally, just to note, if randomForest(x,y,xtest=x,ytest=y)
is used, then, in order to use predict() function the keep.forest flag should be set to TRUE.
model=randomForest(x,y,xtest=x,ytest=y,keep.forest=TRUE).
prob=predict(model,x,type="prob")
prob
WILL be equivalent to model$test$votes
since the test data input are both x
.