predict.glm() with three new categories in the test data (r)(error)

倾然丶 夕夏残阳落幕 提交于 2019-12-01 14:13:34
李哲源

It is impossible to get estimation of new factor levels, in fixed effect modelling, including linear models and generalized linear models. glm (as well as lm) keeps records of what factor levels are presented and used during model fitting, and can be found in testreg$xlevels.

Your model formula for model estimation is:

returnShipment ~ size + color + price + manufacturerID + salutation + 
                 state + age + deliverytime

then predict complains new factor levels 125, 136, 137 for manufactureID. This means, these levels are not inside testreg$xlevels$manufactureID, therefore has no associated coefficient for prediction. In this case, we have to drop this factor variable and use a prediction formula:

returnShipment ~ size + color + price + salutation + 
                 state + age + deliverytime

However, the standard predict routine can not take your customized prediction formula. There are commonly two solutions:

  1. extract model matrix and model coefficients from testreg, and manually predict model terms we want by matrix-vector multiplication. This is what the link given in your post suggests to do;
  2. reset the factor levels in test into any one level appeared in testreg$xlevels$manufactureID, for example, testreg$xlevels$manufactureID[1]. As such, we can still use the standard predict for prediction.

Now, let's first pick up a factor level used for model fitting

xlevels <- testreg$xlevels$manufacturerID
mID125 <- xlevels[1]

Then we assign this level to your prediction data:

replacement <- factor(rep(mID125, length = nrow(test)), levels = xlevels)
test$manufacturerID <- replacement

And we are ready to predict:

pred <- predict(testreg, test, type = "link")  ## don't use type = "response" here!!

In the end, we adjust this linear predictor, by subtracting factor estimate:

est <- coef(testreg)[paste0(manufacturerID, mID125)]
pred <- pred - est

Finally, if you want prediction on the original scale, you apply the inverse of link function:

testreg$family$linkinv(pred)

update:

You complained that you met various troubles in trying the above solutions. Here is why.

Your code:

testreg <- glm(train$returnShipment~ train$size + train$color + 
               train$price + train$manufacturerID + train$salutation + 
               train$state + train$age + train$deliverytime,
               family=binomial(link="logit"), data=train)

is a very bad way to specify your model formula. train$returnShipment, etc, will restrict the environment of getting variables strictly to data frame train, and you will have trouble in later prediction with other data sets, like test.

As a simple example for such drawback, we simulate some toy data and fit a GLM:

set.seed(0); y <- rnorm(50, 0, 1)
set.seed(0); a <- sample(letters[1:4], 50, replace = TRUE)
foo <- data.frame(y = y, a = factor(a))
toy <- glm(foo$y ~ foo$a, data = foo)  ## bad style

> toy$formula
foo$y ~ foo$a  
> toy$xlevels
$`foo$a`
[1] "a" "b" "c" "d"

Now, we see everything comes with a prefix foo$. During prediction:

newdata <- foo[1:2, ]  ## take first 2 rows of "foo" as "newdata"
rm(foo)  ## remove "foo" from R session
predict(toy, newdata)

we get an error:

Error in eval(expr, envir, enclos) : object 'foo' not found

The good style is to specify environment of getting data from data argument of the function:

foo <- data.frame(y = y, a = factor(a))
toy <- glm(y ~ a, data = foo)

then foo$ goes away.

> toy$formula
y ~ a
> toy$xlevels
$a
[1] "a" "b" "c" "d"

This would explain two things:

  1. You complained to me in the comment that when you do testreg$xlevels$manufactureID, you get NULL;
  2. The prediction error you posted

    Error in model.frame.default(Terms, newdata, na.action=na.action, xlev=object$xlevels):
    Factor 'train$manufacturerID' has new levels 125, 136, 137
    

    complains train$manufacturerID instead of test$manufacturerID.

As you have divided your train and test sample based on rownumbers, some factor levels of your variables are not equally represented in both the train and test samples.

You need to do stratified sampling to ensure that both train and test samples have all factor level representations. Use stratified from the splitstackshape package.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!