Cross validation for glm() models

后端未结

关注

 2  888

I\'m trying to do a 10-fold cross validation for some glm models that I have built earlier in R. I\'m a little confused about the cv.glm() function in the boo


                      
              相关标签:


      
      
        
          2条回答        

        
                         				            
            
           
            
                              
                
              
              
                
                  耶瑟儿～        
                
              
                            
                2021-01-31 12:23
              
            
            
                                                                       
I am always a little cautious about using various packages 10-fold cross validation methods.  I have my own simple script to create the test and training partitions manually for any machine learning package:

#Randomly shuffle the data
yourData<-yourData[sample(nrow(yourData)),]

#Create 10 equally size folds
folds <- cut(seq(1,nrow(yourData)),breaks=10,labels=FALSE)

#Perform 10 fold cross validation
for(i in 1:10){
    #Segement your data by fold using the which() function 
    testIndexes <- which(folds==i,arr.ind=TRUE)
    testData <- yourData[testIndexes, ]
    trainData <- yourData[-testIndexes, ]
    #Use test and train data partitions however you desire...
}

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  误落风尘        
                
              
                            
                2021-01-31 12:37
              
            
            
                                                                       
@Roman provided some answers in his comments, however, the answer to your questions is provided by inspecting the code with cv.glm:

I believe this bit of code splits the data set up randomly into the K-folds, arranging rounding as necessary if K does not divide n:

if ((K > n) || (K <= 1)) 
    stop("'K' outside allowable range")
K.o <- K
K <- round(K)
kvals <- unique(round(n/(1L:floor(n/2))))
temp <- abs(kvals - K)
if (!any(temp == 0)) 
    K <- kvals[temp == min(temp)][1L]
if (K != K.o) 
    warning(gettextf("'K' has been set to %f", K), domain = NA)
f <- ceiling(n/K)
s <- sample0(rep(1L:K, f), n)


This bit here shows that the delta value is NOT the root mean square error. It is, as the helpfile says The default is the average squared error function. What does this mean? We can see this by inspecting the function declaration:

function (data, glmfit, cost = function(y, yhat) mean((y - yhat)^2), 
    K = n) 


which shows that within each fold, we calculate the average of the error squared, where error is in the usual sense between predicted response vs actual response.

delta[1] is simply the weighted average of the SUM of all of these terms for each fold, see my inline comments in the code of cv.glm: 

for (i in seq_len(ms)) {
    j.out <- seq_len(n)[(s == i)]
    j.in <- seq_len(n)[(s != i)]
    Call$data <- data[j.in, , drop = FALSE]
    d.glm <- eval.parent(Call)
    p.alpha <- n.s[i]/n #create weighted average for later
    cost.i <- cost(glm.y[j.out], predict(d.glm, data[j.out, 
        , drop = FALSE], type = "response"))
    CV <- CV + p.alpha * cost.i # add weighted average error to running total
    cost.0 <- cost.0 - p.alpha * cost(glm.y, predict(d.glm, 
        data, type = "response"))
}

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
                             
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复