Why can't I use cv.glm on the output of bestglm?

问题

I am trying to do best subset selection on the wine dataset, and then I want to get the test error rate using 10 fold CV. The code I used is -

cost1 <- function(good, pi=0) mean(abs(good-pi) > 0.5)
res.best.logistic <-
    bestglm(Xy = winedata,
            family = binomial,          # binomial family for logistic
            IC = "AIC",                 # Information criteria
            method = "exhaustive")
res.best.logistic$BestModels
best.cv.err<- cv.glm(winedata,res.best.logistic$BestModel,cost1, K=10)

However, this gives the error -

Error in UseMethod("family") : no applicable method for 'family' applied to an object of class "NULL"

I thought that $BestModel is the lm-object that represents the best fit, and that's what manual also says. If that's the case, then why cant I find the test error on it using 10 fold CV, with the help of cv.glm?

The dataset used is the white wine dataset from https://archive.ics.uci.edu/ml/datasets/Wine+Quality and the package used is the boot package for cv.glm, and the bestglm package.

The data was processed as -

winedata <- read.delim("winequality-white.csv", sep = ';')
winedata$quality[winedata$quality< 7] <- "0" #recode
winedata$quality[winedata$quality>=7] <- "1" #recode
winedata$quality <- factor(winedata$quality)# Convert the column to a factor
names(winedata)[names(winedata) == "quality"] <- "good"      #rename 'quality' to 'good'

回答1:

bestglm fit rearranges your data and name your response variable as y, hence if you pass it back into cv.glm, winedata does not have a column y and everything crashes after that

It's always good to check what is the class:

class(res.best.logistic$BestModel)
[1] "glm" "lm"

But if you look at the call of res.best.logistic$BestModel:

res.best.logistic$BestModel$call

glm(formula = y ~ ., family = family, data = Xi, weights = weights)

head(res.best.logistic$BestModel$model)
  y fixed.acidity volatile.acidity citric.acid residual.sugar chlorides
1 0           7.0             0.27        0.36           20.7     0.045
2 0           6.3             0.30        0.34            1.6     0.049
3 0           8.1             0.28        0.40            6.9     0.050
4 0           7.2             0.23        0.32            8.5     0.058
5 0           7.2             0.23        0.32            8.5     0.058
6 0           8.1             0.28        0.40            6.9     0.050
  free.sulfur.dioxide density   pH sulphates
1                  45  1.0010 3.00      0.45
2                  14  0.9940 3.30      0.49
3                  30  0.9951 3.26      0.44
4                  47  0.9956 3.19      0.40
5                  47  0.9956 3.19      0.40
6                  30  0.9951 3.26      0.44

You can substitute things in the call etc, but it's too much of a mess. Fitting is not costly, so make a fit on winedata and pass it to cv.glm:

best_var = apply(res.best.logistic$BestModels[,-ncol(winedata)],1,which)
# take the variable names for best model
best_var = names(best_var[[1]])
new_form = as.formula(paste("good ~", paste(best_var,collapse="+")))
fit = glm(new_form,winedata,family="binomial")

best.cv.err<- cv.glm(winedata,fit,cost1, K=10)

来源：https://stackoverflow.com/questions/61025542/why-cant-i-use-cv-glm-on-the-output-of-bestglm

标签

machine-learning

logistic-regression

cross-validation