I am using the following geoadditive model
library(gamair)
library(mgcv)
data(mack)
mack$log.net.area <- log(mack$net.area)
gm2 <- gam(egg.count ~ s(
predict
still requires all variables used in your model to be presented in newdata
, but you can pass in some arbitrary values, like 0
s, to those covariates you don't have, then use type = "terms"
and terms = name_of_the_wanted_smooth_term
to proceed. Use
sapply(gm2$smooth, "[[", "label")
#[1] "s(lon,lat)" "s(I(b.depth^0.5))" "s(c.dist)"
#[4] "s(temp.20m)"
to check what smooth terms are in your model.
## new spatial locations to predict
newdat <- read.table(text = "lon lat
1 -3.00 44
4 -2.75 44
7 -2.50 44
10 -2.25 44
13 -2.00 44
16 -1.75 44")
## "garbage" values, just to pass the variable names checking in `predict.gam`
newdat[c("b.depth", "c.dist", "temp.20m", "log.net.area")] <- 0
## prediction on the link scale
pred_link <- predict(gm2, newdata = newdat, type = "terms", terms = "s(lon,lat)")
# s(lon,lat)
#1 -1.9881967
#4 -1.9137971
#7 -1.6365945
#10 -1.1247837
#13 -0.7910023
#16 -0.7234683
#attr(,"constant")
#(Intercept)
# 2.553535
## simplify to vector
pred_link <- attr(pred_link, "constant") + rowSums(pred_link)
#[1] 0.5653381 0.6397377 0.9169403 1.4287511 1.7625325 1.8300665
## prediction on the response scale
pred_response <- gm2$family$linkinv(pred_link)
#[1] 1.760043 1.895983 2.501625 4.173484 5.827176 6.234301
I don't normally use predict.gam
if I want to do prediction for a specific smooth term. The logic of predict.gam
is to do prediction for all terms first, that is, the same as your doing type = "terms"
. Then
type = "link"
, do a rowSums
on all term-wise predictions plus an intercept (possibly with offset
);type = "terms"
, and "terms"
or "exclude"
are unspecified, return the result as it is;type = "terms"
and you have specified "terms"
and / or "exclude"
, some post-process is done to drop terms you don't want and only give you those you want.So, predict.gam
will always do computation for all terms, even if you just want a single term.
Knowing the inefficiency behind this, this is what I will do:
sm <- gm2$smooth[[1]] ## extract smooth construction info for `s(lon,lat)`
Xp <- PredictMat(sm, newdat) ## predictor matrix
b <- gm2$coefficients[with(sm, first.para:last.para)] ## coefficients for this term
pred_link <- c(Xp %*% b) + gm2$coef[[1]] ## this term + intercept
#[1] 0.5653381 0.6397377 0.9169403 1.4287511 1.7625325 1.8300665
pred_response <- gm2$family$linkinv(pred_link)
#[1] 1.760043 1.895983 2.501625 4.173484 5.827176 6.234301
You see, we get the same result.
Won't the result depend the way on the value assigned to the covariates (here 0)?
Some garbage prediction will be made at those garbage values, but predict.gam
discards them in the end.
Thanks, you are right. I am not totally sure to understand why then there is the option to add the covariates values at new locations.
Code maintenance is, as far as I feel, very difficult for a big package like mgcv
. The code needs be changed significantly if you want it to suit every user's need. Obviously the predict.gam
logic as I described here will be inefficient when people, like you, just want it to predict a certain smooth. And in theory if this is the case, variable names checking in newdata
can ignore those terms not wanted by users. But, that requires significant change of predict.gam
, and could potentially introduce many bugs due to code changes. Furthermore, you have to submit a changelog to CRAN, and CRAN may just not be happy to see this drastic change.
Simon once shared his feelings: there are many people telling me, I should write mgcv
as this or as that, but I simply can't. Yeah, give some sympathy to a package author / maintainer like him.
Thanks for the update answer. However, I don't understand why the predictions don't depend on the values of the covariates at the new locations.
It will depend if you provide covariates values for b.depth
, c.dist
, temp.20m
, log.net.area
. But since you don't have them at new locations, the prediction is just to assume these effects to be 0
.
OK thanks I see now! So would it be correct to say that in the absence of covariate values at new locations I am only predicting the response from the spatial autocorrelation of the residuals?
You are only predicting the spatial field / smooth. In GAM approach the spatial field is modeled as part of mean, not variance-covariance (as in kriging), so I think your use of "residuals" is not correct here.
Yes, you are right. Just to understand what this code does: would it be correct to say that I am predicting how the response changes over space but not its actual values at the new locations (since for that I would need the values of the covariates at these locations)?
Correct. You can try predict.gam
with or without terms = "s(lon,lat)"
to help you digest the output. See how it changes when you vary garbage values passed to other covariates.
## a possible set of garbage values for covariates
newdat[c("b.depth", "c.dist", "temp.20m", "log.net.area")] <- 0
predict(gm2, newdat, type = "terms")
# s(lon,lat) s(I(b.depth^0.5)) s(c.dist) s(temp.20m)
#1 -1.9881967 -1.05514 0.4739174 -1.466549
#4 -1.9137971 -1.05514 0.4739174 -1.466549
#7 -1.6365945 -1.05514 0.4739174 -1.466549
#10 -1.1247837 -1.05514 0.4739174 -1.466549
#13 -0.7910023 -1.05514 0.4739174 -1.466549
#16 -0.7234683 -1.05514 0.4739174 -1.466549
#attr(,"constant")
#(Intercept)
# 2.553535
predict(gm2, newdat, type = "terms", terms = "s(lon,lat)")
# s(lon,lat)
#1 -1.9881967
#4 -1.9137971
#7 -1.6365945
#10 -1.1247837
#13 -0.7910023
#16 -0.7234683
#attr(,"constant")
#(Intercept)
# 2.553535
## another possible set of garbage values for covariates
newdat[c("b.depth", "c.dist", "temp.20m", "log.net.area")] <- 1
# s(lon,lat) s(I(b.depth^0.5)) s(c.dist) s(temp.20m)
#1 -1.9881967 -0.9858522 -0.3749018 -1.269878
#4 -1.9137971 -0.9858522 -0.3749018 -1.269878
#7 -1.6365945 -0.9858522 -0.3749018 -1.269878
#10 -1.1247837 -0.9858522 -0.3749018 -1.269878
#13 -0.7910023 -0.9858522 -0.3749018 -1.269878
#16 -0.7234683 -0.9858522 -0.3749018 -1.269878
#attr(,"constant")
#(Intercept)
# 2.553535
predict(gm2, newdat, type = "terms", terms = "s(lon,lat)")
# s(lon,lat)
#1 -1.9881967
#4 -1.9137971
#7 -1.6365945
#10 -1.1247837
#13 -0.7910023
#16 -0.7234683
#attr(,"constant")
#(Intercept)
# 2.553535