R: Cross validation on a dataset with factors

前端 未结 3 996
忘掉有多难
忘掉有多难 2020-12-28 20:46

Often, I want to run a cross validation on a dataset which contains some factor variables and after running for a while, the cross validation routine fails with the error: <

3条回答
  •  礼貌的吻别
    2020-12-28 21:36

    There don't seem to be many simple solutions around the web so here's one I worked out that should be easy to generalize to as many factors as you need. It uses pre-installed packages and Caret but you could get away with just base R if you really wanted.

    To use cross-validation when you have multiple factors follow a two-step process. Convert the factors to numerics and then multiply them together. Use this new variable as the target variable in a stratified sampling function. Be sure to remove it or keep it out of your training set after creating your folds.

    If y is your DV and x is a factor then:

    #Simulated factors (which are conveniently distributed for the example)
    dataset <-data.frame(x=as.factor(rep(c(1,10),1000)),y=as.factor(rep(c(1,2,3,4),250)[sample(1000)]))
    
    #Convert the factors to numerics and multiply together in new variable
    dataset$cv.variable <-as.numeric(levels(dataset$x))[dataset$x]*as.numeric(levels(dataset$y))[dataset$y]
    
    
    prop.table(table(dataset$y)) #One way to view distribution of levels
    ftable(dataset$x,dataset$y)  #A full table of all x and y combinations
    
    folds <- caret::createFolds(dataset$cv.variable,k=10) 
    testIndexes <- folds[[k]]
    testData <- as.data.frame(dataset[testIndexes, ])
    trainData <- as.data.frame(dataset[-testIndexes, ])
    
    prop.table(table(testData$y)) 
    ftable(testData$x,testData$y) #evaluate distribution
    

    which should produce a result that is close to balanced.

    Note: In real life, if your sample lacks the requisite unique combinations of factors then your problem is harder overcome and might be impossible. You can either drop some levels from consideration before creating folds or employ some kind of over-sampling.

提交回复
热议问题