I have a highly imbalanced data set with target class instances in the following ratio 60000:1000:1000:50
(i.e. a total of 4 classes). I want to use randomFor
classwt
is correctly passed on to randomForest
, check this example:
library(randomForest)
rf = randomForest(Species~., data = iris, classwt = c(1E-5,1E-5,1E5))
rf
#Call:
# randomForest(formula = Species ~ ., data = iris, classwt = c(1e-05, 1e-05, 1e+05))
# Type of random forest: classification
# Number of trees: 500
#No. of variables tried at each split: 2
#
# OOB estimate of error rate: 66.67%
#Confusion matrix:
# setosa versicolor virginica class.error
#setosa 0 0 50 1
#versicolor 0 0 50 1
#virginica 0 0 50 0
Class weights are the priors on the outcomes. You need to balance them to achieve the results you want.
On strata
and sampsize
this answer might be of help: https://stackoverflow.com/a/20151341/2874779
In general, sampsize
with the same size for all classes seems reasonable. strata
is a factor that's going to be used for stratified resampling, in your case you don't need to input anything.