caret train() predicts very different then predict.glm()

后端 未结 1 1564
予麋鹿
予麋鹿 2021-01-29 19:14

I am trying to estimate a logistic regression, using the 10-fold cross-validation.

#import libraries
library(car); library(caret); library(e1071); library(verif         


        
相关标签:
1条回答
  • 2021-01-29 19:56

    You are trying to get an idea of the in-sample fit using a confusion matrix. Your first approach using the glm() function is fine.

    The problem with the second approach using train() lies in the returned object. You are trying to extract the in-sample fitted values from it by fit$pred$pred. However, fit$pred does not contain the fitted values that are aligned to chile.v or chile$vote. It contains the observations and fitted values of the different (10) folds:

    > head(fit$pred)
      pred obs rowIndex parameter Resample
    1    N   N        2      none   Fold01
    2    Y   Y       20      none   Fold01
    3    Y   Y       28      none   Fold01
    4    N   N       38      none   Fold01
    5    N   N       55      none   Fold01
    6    N   N       66      none   Fold01
    > tail(fit$pred)
         pred obs rowIndex parameter Resample
    1698    Y   Y     1592      none   Fold10
    1699    Y   N     1594      none   Fold10
    1700    N   N     1621      none   Fold10
    1701    N   N     1656      none   Fold10
    1702    N   N     1671      none   Fold10
    1703    Y   Y     1689      none   Fold10 
    

    So, due to the randomness of the folds, and because you are predicting 0 or 1, you get an accuracy of roughly 50%.

    The in-sample fitted values you are looking for are in fit$finalModel$fitted.values. Using those:

    fitpred <- fit$finalModel$fitted.values
    fitpredt <- function(t) ifelse(fitpred > t , 1,0)
    > confusionMatrix(fitpredt(0.3),chile.v)
    Confusion Matrix and Statistics
    
              Reference
    Prediction   0   1
             0 773  44
             1  94 792
    
                   Accuracy : 0.919          
                     95% CI : (0.905, 0.9315)
        No Information Rate : 0.5091         
        P-Value [Acc > NIR] : < 2.2e-16      
    
                      Kappa : 0.8381         
     Mcnemar's Test P-Value : 3.031e-05      
    
                Sensitivity : 0.8916         
                Specificity : 0.9474         
             Pos Pred Value : 0.9461         
             Neg Pred Value : 0.8939         
                 Prevalence : 0.5091         
             Detection Rate : 0.4539         
       Detection Prevalence : 0.4797         
          Balanced Accuracy : 0.9195         
    
           'Positive' Class : 0               
    

    Now the accuracy is around the expected value. Setting the threshold to 0.5 yields about the same accuracy as the estimate from the 10-fold cross validation:

    > confusionMatrix(fitpredt(0.5),chile.v)
    Confusion Matrix and Statistics
    
              Reference
    Prediction   0   1
             0 809  64
             1  58 772
    
                   Accuracy : 0.9284          
                     95% CI : (0.9151, 0.9402)
    [rest of the output omitted]            
    
    > fit
    Generalized Linear Model 
    
    1703 samples
       7 predictors
       2 classes: 'N', 'Y' 
    
    No pre-processing
    Resampling: Cross-Validated (10 fold) 
    
    Summary of sample sizes: 1533, 1532, 1532, 1533, 1532, 1533, ... 
    
    Resampling results
    
      Accuracy  Kappa  Accuracy SD  Kappa SD
      0.927     0.854  0.0134       0.0267  
    

    Additionally, regarding your expectation "that the cross validated results should not perform much worse than the first model," please check summary(res.chileIII) and summary(fit). The fitted models and coefficients are exactly the same so they will give the same results.

    P.S. I know my answer to this question is late--i.e. this is quite an old question. Is it OK to answer these questions anyway? I am new here and did not find anything about "late answers" in the help.

    0 讨论(0)
提交回复
热议问题