R: Cross validation on a dataset with factors

前端 未结 3 997
忘掉有多难
忘掉有多难 2020-12-28 20:46

Often, I want to run a cross validation on a dataset which contains some factor variables and after running for a while, the cross validation routine fails with the error: <

相关标签:
3条回答
  • 2020-12-28 21:28

    When I call traceback I get this:

    > traceback()
    9: stop(sprintf(ngettext(length(m), "factor %s has new level %s", 
           "factor %s has new levels %s"), nm, paste(nxl[m], collapse = ", ")), 
           domain = NA)
    8: model.frame.default(Terms, newdata, na.action = na.action, xlev = object$xlevels)
    7: model.frame(Terms, newdata, na.action = na.action, xlev = object$xlevels)
    6: predict.lm(object, newdata, se.fit, scale = 1, type = ifelse(type == 
           "link", "response", type), terms = terms, na.action = na.action)
    5: predict.glm(d.glm, data[j.out, , drop = FALSE], type = "response")
    4: predict(d.glm, data[j.out, , drop = FALSE], type = "response")
    3: mean((y - yhat)^2)
    2: cost(glm.y[j.out], predict(d.glm, data[j.out, , drop = FALSE], 
           type = "response"))
    1: cv.glm(d, m, K = 2)
    

    And looking at the cv.glm function gives:

    > cv.glm
    function (data, glmfit, cost = function(y, yhat) mean((y - yhat)^2), 
        K = n) 
    {
        call <- match.call()
        if (!exists(".Random.seed", envir = .GlobalEnv, inherits = FALSE)) 
            runif(1)
        seed <- get(".Random.seed", envir = .GlobalEnv, inherits = FALSE)
        n <- nrow(data)
        out <- NULL
        if ((K > n) || (K <= 1)) 
            stop("'K' outside allowable range")
        K.o <- K
        K <- round(K)
        kvals <- unique(round(n/(1L:floor(n/2))))
        temp <- abs(kvals - K)
        if (!any(temp == 0)) 
            K <- kvals[temp == min(temp)][1L]
        if (K != K.o) 
            warning(gettextf("'K' has been set to %f", K), domain = NA)
        f <- ceiling(n/K)
        s <- sample0(rep(1L:K, f), n)
        n.s <- table(s)
        glm.y <- glmfit$y
        cost.0 <- cost(glm.y, fitted(glmfit))
        ms <- max(s)
        CV <- 0
        Call <- glmfit$call
        for (i in seq_len(ms)) {
            j.out <- seq_len(n)[(s == i)]
            j.in <- seq_len(n)[(s != i)]
            Call$data <- data[j.in, , drop = FALSE]
            d.glm <- eval.parent(Call)
            p.alpha <- n.s[i]/n
            cost.i <- cost(glm.y[j.out], predict(d.glm, data[j.out, 
                , drop = FALSE], type = "response"))
            CV <- CV + p.alpha * cost.i
            cost.0 <- cost.0 - p.alpha * cost(glm.y, predict(d.glm, 
                data, type = "response"))
        }
        list(call = call, K = K, delta = as.numeric(c(CV, CV + cost.0)), 
            seed = seed)
    }
    

    It seems the problem has to do with your extremely small sample size and categorical effect (with values "A", "B", and "C"). You are fitting a glm with 2 effects: "B:A" and "C:A". In each CV iteration you bootstrap from the sample dataset and fit a new model d.glm. Given the size, the bootstrapped data are guaranteed to come up with 1 or more iteration in which the value "C" is not sampled, hence the error comes from calculating fitted probabilities from the bootstrap model from the training data in which validation data has a "C" level for x not observed in the training data.

    Frank Harrell (often on stats.stackexchange.com) wrote in Regression Modelling Strategies that one ought to favor against split sample validation when sample size is small and/or some cell counts are small in categorical data analysis. Singularity (as you are seeing here) is one of many reasons why I think this is true.

    Given the small sample size here, you should consider some split sample cross validation alternatives like a permutation test, or a parametric bootstrap. Another important consideration is exactly why you feel model based inference isn't correct. As Tukey said of the bootstrap, he'd like to call it a shotgun. It will blow the head off of any problem, as long as you're willing to reassemble the pieces.

    0 讨论(0)
  • 2020-12-28 21:33

    Everyone agrees that there sure is an optimal solution. But personally, I would just try the cv.glm call until it works usingwhile.

    m.cv<- try(cv.glm(d, m, K=2)) #First try
    class(m.cv) #Sometimes error, sometimes list
    while ( inherits(m.cv, "try-error") ) {
    m.cv<- try(cv.glm(d, m, K=2))
    }
    class(m.cv) #always list
    

    I've tried it with 100,000 rows in the data.fame and it only takes a few seconds.

    library(boot)
    n <-100000
    d <- data.frame(x=c(rep('A',n), rep('B', n), 'C', 'C'), y=1:(n*2+2))
    m <- glm(y ~ x, data=d)
    
    m.cv<- try(cv.glm(d, m, K=2))
    class(m.cv) #Sometimes error, sometimes list
    while ( inherits(m.cv, "try-error") ) {
    m.cv<- try(cv.glm(d, m, K=2))
    }
    class(m.cv) #always list
    
    0 讨论(0)
  • 2020-12-28 21:36

    There don't seem to be many simple solutions around the web so here's one I worked out that should be easy to generalize to as many factors as you need. It uses pre-installed packages and Caret but you could get away with just base R if you really wanted.

    To use cross-validation when you have multiple factors follow a two-step process. Convert the factors to numerics and then multiply them together. Use this new variable as the target variable in a stratified sampling function. Be sure to remove it or keep it out of your training set after creating your folds.

    If y is your DV and x is a factor then:

    #Simulated factors (which are conveniently distributed for the example)
    dataset <-data.frame(x=as.factor(rep(c(1,10),1000)),y=as.factor(rep(c(1,2,3,4),250)[sample(1000)]))
    
    #Convert the factors to numerics and multiply together in new variable
    dataset$cv.variable <-as.numeric(levels(dataset$x))[dataset$x]*as.numeric(levels(dataset$y))[dataset$y]
    
    
    prop.table(table(dataset$y)) #One way to view distribution of levels
    ftable(dataset$x,dataset$y)  #A full table of all x and y combinations
    
    folds <- caret::createFolds(dataset$cv.variable,k=10) 
    testIndexes <- folds[[k]]
    testData <- as.data.frame(dataset[testIndexes, ])
    trainData <- as.data.frame(dataset[-testIndexes, ])
    
    prop.table(table(testData$y)) 
    ftable(testData$x,testData$y) #evaluate distribution
    

    which should produce a result that is close to balanced.

    Note: In real life, if your sample lacks the requisite unique combinations of factors then your problem is harder overcome and might be impossible. You can either drop some levels from consideration before creating folds or employ some kind of over-sampling.

    0 讨论(0)
提交回复
热议问题