Often, I want to run a cross validation on a dataset which contains some factor variables and after running for a while, the cross validation routine fails with the error: <
There don't seem to be many simple solutions around the web so here's one I worked out that should be easy to generalize to as many factors as you need. It uses pre-installed packages and Caret but you could get away with just base R if you really wanted.
To use cross-validation when you have multiple factors follow a two-step process. Convert the factors to numerics and then multiply them together. Use this new variable as the target variable in a stratified sampling function. Be sure to remove it or keep it out of your training set after creating your folds.
If y is your DV and x is a factor then:
#Simulated factors (which are conveniently distributed for the example)
dataset <-data.frame(x=as.factor(rep(c(1,10),1000)),y=as.factor(rep(c(1,2,3,4),250)[sample(1000)]))
#Convert the factors to numerics and multiply together in new variable
dataset$cv.variable <-as.numeric(levels(dataset$x))[dataset$x]*as.numeric(levels(dataset$y))[dataset$y]
prop.table(table(dataset$y)) #One way to view distribution of levels
ftable(dataset$x,dataset$y) #A full table of all x and y combinations
folds <- caret::createFolds(dataset$cv.variable,k=10)
testIndexes <- folds[[k]]
testData <- as.data.frame(dataset[testIndexes, ])
trainData <- as.data.frame(dataset[-testIndexes, ])
prop.table(table(testData$y))
ftable(testData$x,testData$y) #evaluate distribution
which should produce a result that is close to balanced.
Note: In real life, if your sample lacks the requisite unique combinations of factors then your problem is harder overcome and might be impossible. You can either drop some levels from consideration before creating folds or employ some kind of over-sampling.