I am trying to estimate a logistic regression, using the 10-fold cross-validation.
#import libraries
library(car); library(caret); library(e1071); library(verif
You are trying to get an idea of the in-sample fit using a confusion matrix. Your first approach using the glm()
function is fine.
The problem with the second approach using train()
lies in the returned object. You are trying to extract the in-sample fitted values from it by fit$pred$pred
. However, fit$pred
does not contain the fitted values that are aligned to chile.v
or chile$vote
. It contains the observations and fitted values of the different (10) folds:
> head(fit$pred)
pred obs rowIndex parameter Resample
1 N N 2 none Fold01
2 Y Y 20 none Fold01
3 Y Y 28 none Fold01
4 N N 38 none Fold01
5 N N 55 none Fold01
6 N N 66 none Fold01
> tail(fit$pred)
pred obs rowIndex parameter Resample
1698 Y Y 1592 none Fold10
1699 Y N 1594 none Fold10
1700 N N 1621 none Fold10
1701 N N 1656 none Fold10
1702 N N 1671 none Fold10
1703 Y Y 1689 none Fold10
So, due to the randomness of the folds, and because you are predicting 0 or 1, you get an accuracy of roughly 50%.
The in-sample fitted values you are looking for are in fit$finalModel$fitted.values
. Using those:
fitpred <- fit$finalModel$fitted.values
fitpredt <- function(t) ifelse(fitpred > t , 1,0)
> confusionMatrix(fitpredt(0.3),chile.v)
Confusion Matrix and Statistics
Reference
Prediction 0 1
0 773 44
1 94 792
Accuracy : 0.919
95% CI : (0.905, 0.9315)
No Information Rate : 0.5091
P-Value [Acc > NIR] : < 2.2e-16
Kappa : 0.8381
Mcnemar's Test P-Value : 3.031e-05
Sensitivity : 0.8916
Specificity : 0.9474
Pos Pred Value : 0.9461
Neg Pred Value : 0.8939
Prevalence : 0.5091
Detection Rate : 0.4539
Detection Prevalence : 0.4797
Balanced Accuracy : 0.9195
'Positive' Class : 0
Now the accuracy is around the expected value. Setting the threshold to 0.5 yields about the same accuracy as the estimate from the 10-fold cross validation:
> confusionMatrix(fitpredt(0.5),chile.v)
Confusion Matrix and Statistics
Reference
Prediction 0 1
0 809 64
1 58 772
Accuracy : 0.9284
95% CI : (0.9151, 0.9402)
[rest of the output omitted]
> fit
Generalized Linear Model
1703 samples
7 predictors
2 classes: 'N', 'Y'
No pre-processing
Resampling: Cross-Validated (10 fold)
Summary of sample sizes: 1533, 1532, 1532, 1533, 1532, 1533, ...
Resampling results
Accuracy Kappa Accuracy SD Kappa SD
0.927 0.854 0.0134 0.0267
Additionally, regarding your expectation "that the cross validated results should not perform much worse than the first model," please check summary(res.chileIII)
and summary(fit)
. The fitted models and coefficients are exactly the same so they will give the same results.
P.S. I know my answer to this question is late--i.e. this is quite an old question. Is it OK to answer these questions anyway? I am new here and did not find anything about "late answers" in the help.