R: Cross validation on a dataset with factors

前端未结

关注

 3  996

忘掉有多难 2020-12-28 20:46

Often, I want to run a cross validation on a dataset which contains some factor variables and after running for a while, the cross validation routine fails with the error: <

3条回答

礼貌的吻别 (楼主)

2020-12-28 21:36
There don't seem to be many simple solutions around the web so here's one I worked out that should be easy to generalize to as many factors as you need. It uses pre-installed packages and Caret but you could get away with just base R if you really wanted.

To use cross-validation when you have multiple factors follow a two-step process. Convert the factors to numerics and then multiply them together. Use this new variable as the target variable in a stratified sampling function. Be sure to remove it or keep it out of your training set after creating your folds.

If y is your DV and x is a factor then:
```
#Simulated factors (which are conveniently distributed for the example)
dataset <-data.frame(x=as.factor(rep(c(1,10),1000)),y=as.factor(rep(c(1,2,3,4),250)[sample(1000)]))

#Convert the factors to numerics and multiply together in new variable
dataset$cv.variable <-as.numeric(levels(dataset$x))[dataset$x]*as.numeric(levels(dataset$y))[dataset$y]


prop.table(table(dataset$y)) #One way to view distribution of levels
ftable(dataset$x,dataset$y)  #A full table of all x and y combinations

folds <- caret::createFolds(dataset$cv.variable,k=10) 
testIndexes <- folds[[k]]
testData <- as.data.frame(dataset[testIndexes, ])
trainData <- as.data.frame(dataset[-testIndexes, ])

prop.table(table(testData$y)) 
ftable(testData$x,testData$y) #evaluate distribution
```
which should produce a result that is close to balanced.

Note: In real life, if your sample lacks the requisite unique combinations of factors then your problem is harder overcome and might be impossible. You can either drop some levels from consideration before creating folds or employ some kind of over-sampling.
0 讨论(0)

查看其它3个回答
发布评论:

提交评论
- 加载中...