Cross validation for glm() models

后端 未结 2 888
南旧
南旧 2021-01-31 11:33

I\'m trying to do a 10-fold cross validation for some glm models that I have built earlier in R. I\'m a little confused about the cv.glm() function in the boo

相关标签:
2条回答
  • 2021-01-31 12:23

    I am always a little cautious about using various packages 10-fold cross validation methods. I have my own simple script to create the test and training partitions manually for any machine learning package:

    #Randomly shuffle the data
    yourData<-yourData[sample(nrow(yourData)),]
    
    #Create 10 equally size folds
    folds <- cut(seq(1,nrow(yourData)),breaks=10,labels=FALSE)
    
    #Perform 10 fold cross validation
    for(i in 1:10){
        #Segement your data by fold using the which() function 
        testIndexes <- which(folds==i,arr.ind=TRUE)
        testData <- yourData[testIndexes, ]
        trainData <- yourData[-testIndexes, ]
        #Use test and train data partitions however you desire...
    }
    
    0 讨论(0)
  • 2021-01-31 12:37

    @Roman provided some answers in his comments, however, the answer to your questions is provided by inspecting the code with cv.glm:

    I believe this bit of code splits the data set up randomly into the K-folds, arranging rounding as necessary if K does not divide n:

    if ((K > n) || (K <= 1)) 
        stop("'K' outside allowable range")
    K.o <- K
    K <- round(K)
    kvals <- unique(round(n/(1L:floor(n/2))))
    temp <- abs(kvals - K)
    if (!any(temp == 0)) 
        K <- kvals[temp == min(temp)][1L]
    if (K != K.o) 
        warning(gettextf("'K' has been set to %f", K), domain = NA)
    f <- ceiling(n/K)
    s <- sample0(rep(1L:K, f), n)
    

    This bit here shows that the delta value is NOT the root mean square error. It is, as the helpfile says The default is the average squared error function. What does this mean? We can see this by inspecting the function declaration:

    function (data, glmfit, cost = function(y, yhat) mean((y - yhat)^2), 
        K = n) 
    

    which shows that within each fold, we calculate the average of the error squared, where error is in the usual sense between predicted response vs actual response.

    delta[1] is simply the weighted average of the SUM of all of these terms for each fold, see my inline comments in the code of cv.glm:

    for (i in seq_len(ms)) {
        j.out <- seq_len(n)[(s == i)]
        j.in <- seq_len(n)[(s != i)]
        Call$data <- data[j.in, , drop = FALSE]
        d.glm <- eval.parent(Call)
        p.alpha <- n.s[i]/n #create weighted average for later
        cost.i <- cost(glm.y[j.out], predict(d.glm, data[j.out, 
            , drop = FALSE], type = "response"))
        CV <- CV + p.alpha * cost.i # add weighted average error to running total
        cost.0 <- cost.0 - p.alpha * cost(glm.y, predict(d.glm, 
            data, type = "response"))
    }
    
    0 讨论(0)
提交回复
热议问题