The Effect of Specifying Training Data as New Data when Making Random Forest Predictions in R

问题

While using the predict function in R to get the predictions from a Random Forest model, I misspecified the training data as newdata as follows:

RF1pred <- predict(RF1, newdata=TrainS1, type = "class")

Used like this, I get extremely high accuracy and AUC, which I am sure is not right, but I couldn't find a good explanation for it. This thread is the closest I got, but I can's say I fully understand the explanation there.

If someone could elaborate, I will be grateful.

Thank you!

EDIT: Important to note: I get sensible accuracy and AUC if I run the prediction without specifying a dataset altogether, like so:

RF1pred <- predict(RF1, type = "class")

If a new dataset is not explicitly specified, isn't the training data used for prediction. Hence, shouldn't I get the same results from both lines of code?

EDIT2: Here is a sample code with random data that illustrates the point. When predicting without specifying newdata, the AUC is 0.4893. When newdata=train is explicitly specified, the AUC is 0.7125.

# Generate sample data
set.seed(15)
train <- data.frame(x1=sample(0:1, 100, replace=T), x2=rpois(100,10), y=sample(0:1, 100, replace=T))

# Build random forest
library(randomForest)
model <- randomForest(x1 ~ x2, data=train)
pred1 <- predict(model)
pred2 <- predict(model, newdata = train)

# Calculate AUC
library(ROCR)
ROCRpred1 <- prediction(pred1, train$x1)
AUC <- as.numeric(performance(ROCRpred1, "auc")@y.values)
AUC  # 0.4893
ROCRpred2 <- prediction(pred2, train$x1)
AUC <- as.numeric(performance(ROCRpred2, "auc")@y.values)
AUC  # 0.7125

回答1:

If you look at the documentation for predict.randomForest you will see that if you do not supply a new data set you will get the out-of-bag (OOB) performance of the model. Since the OOB performance is theoretically related to the performance of your model on a different data set, the results will be much more realistic (although still not a substitute for a real, independently collected, validation set).

来源：https://stackoverflow.com/questions/31594951/the-effect-of-specifying-training-data-as-new-data-when-making-random-forest-pre

标签

random-forest

predict