问题
Is it possible to get the OOB samples used by random forest algorithm for each tree ? I'm using R language. I know that RandomForest algorithm uses almost 66% of the data (selected randomly) to grow up each tree, and 34 % of the data as OOB samples to measure the OOB error, but I don't know how to get those OOB samples for each tree ?
Any idea ?
回答1:
Assuming you are using the randomForest
package, you just need to set the keep.inbag
argument to TRUE
.
library(randomForest)
set.seed(1)
rf <- randomForest(Species ~ ., iris, keep.inbag = TRUE)
The output list will contain an n by ntree matrix that can be accessed by the name inbag
.
dim(rf$inbag)
# [1] 150 500
rf$inbag[1:5, 1:3]
# [,1] [,2] [,3]
# 1 0 1 0
# 2 1 1 0
# 3 1 0 1
# 4 1 0 1
# 5 0 0 2
The values in the matrix tell you how many times a sample was in-bag. For example, the value of 2 in row 5 column 3 above says that the 5th observation was included in-bag twice for the 3rd tree.
As a bit of background here, a sample can show up in-bag more than once (hence the 2) because by default the sampling is done with replacement.
You can also sample without replacement via the replace
parameter.
set.seed(1)
rf2 <- randomForest(Species ~ ., iris, keep.inbag = TRUE, replace = FALSE)
And now we can verify that without replacement, the maximum number of times any sample is included is once.
# with replacement, the maximum number of times a sample is included in a tree is 7
max(rf$inbag)
# [1] 7
# without replacemnet, the maximum number of times a sample is included in a tree is 1
max(rf2$inbag)
# [1] 1
来源:https://stackoverflow.com/questions/47728851/how-can-i-get-the-oob-samples-used-for-each-tree-in-random-forest-model-r