Does anyone know how gbm
in R
handles missing values? I can't seem to find any explanation using google.
To explain what gbm does with missing predictors, let's first visualize a single tree of a gbm object.
Suppose you have a gbm object mygbm
. Using pretty.gbm.tree(mygbm, i.tree=1)
you can visualize the first tree on mygbm, e.g.:
SplitVar SplitCodePred LeftNode RightNode MissingNode ErrorReduction Weight Prediction
0 46 1.629728e+01 1 5 9 26.462908 1585 -4.396393e-06
1 45 1.850000e+01 2 3 4 11.363868 939 -4.370936e-04
2 -1 2.602236e-04 -1 -1 -1 0.000000 271 2.602236e-04
3 -1 -7.199873e-04 -1 -1 -1 0.000000 668 -7.199873e-04
4 -1 -4.370936e-04 -1 -1 -1 0.000000 939 -4.370936e-04
5 20 0.000000e+00 6 7 8 8.638042 646 6.245552e-04
6 -1 3.533436e-04 -1 -1 -1 0.000000 483 3.533436e-04
7 -1 1.428207e-03 -1 -1 -1 0.000000 163 1.428207e-03
8 -1 6.245552e-04 -1 -1 -1 0.000000 646 6.245552e-04
9 -1 -4.396393e-06 -1 -1 -1 0.000000 1585 -4.396393e-06
See the gbm documentation for details. Each row corresponds to a node, and the first (unnamed) column is the node number. We see that each node has a left and right node (which are set to -1 in case the node is a leaf). We also see each node has associated a MissingNode
.
To run an observation down the tree, we start at node 0. If an observation has a missing value on SplitVar
= 46, then it will be sent down the tree to the node MissingNode
= 9. The prediction of the tree for such observation will be SplitCodePred
= -4.396393e-06, which is the same prediction the tree had before any split is made to node zero (Prediction
= -4.396393e-06 for node zero).
The procedure is similar for other nodes and split variables.
It appears to send missing values to a separate node within each tree. If you have a gbm object called "mygbm" then you'll see by typing "pretty.gbm.tree(mygbm, i.tree = 1)" that for each split in the tree there is a LeftNode a RightNode and a MissingNode. This implies that (assuming you have interaction.depth=1) each tree will have 3 terminal nodes (1 for each side of the split and one for where the predictor is missing).
The gbm package in particular deals with NAs (missing values) as follows. The algorithm works by building and serially combining classification or regression trees. So-called base learner trees are built by divvying up observations into Left and Right splits (@user2332165 is right). There is also a separate node type of Missing in gbm. If the row or observation does not have a value for that variable, the algorithm will apply a surrogate split method.
If you want to understand surrogate splitting better, I recommend reading the package rpart vignette.
The official guide to gbms introduces missing values to the test data, so I would assume that they are coded to handle missing values.
Start with the source code then. Just typing gbm
at the console shows you the source code:
function (formula = formula(data), distribution = "bernoulli",
data = list(), weights, var.monotone = NULL, n.trees = 100,
interaction.depth = 1, n.minobsinnode = 10, shrinkage = 0.001,
bag.fraction = 0.5, train.fraction = 1, cv.folds = 0, keep.data = TRUE,
verbose = TRUE)
{
mf <- match.call(expand.dots = FALSE)
m <- match(c("formula", "data", "weights", "offset"), names(mf),
0)
mf <- mf[c(1, m)]
mf$drop.unused.levels <- TRUE
mf$na.action <- na.pass
mf[[1]] <- as.name("model.frame")
mf <- eval(mf, parent.frame())
Terms <- attr(mf, "terms")
y <- model.response(mf, "numeric")
w <- model.weights(mf)
offset <- model.offset(mf)
var.names <- attributes(Terms)$term.labels
x <- model.frame(terms(reformulate(var.names)), data, na.action = na.pass)
response.name <- as.character(formula[[2]])
if (is.character(distribution))
distribution <- list(name = distribution)
cv.error <- NULL
if (cv.folds > 1) {
if (distribution$name == "coxph")
i.train <- 1:floor(train.fraction * nrow(y))
else i.train <- 1:floor(train.fraction * length(y))
cv.group <- sample(rep(1:cv.folds, length = length(i.train)))
cv.error <- rep(0, n.trees)
for (i.cv in 1:cv.folds) {
if (verbose)
cat("CV:", i.cv, "\n")
i <- order(cv.group == i.cv)
gbm.obj <- gbm.fit(x[i.train, , drop = FALSE][i,
, drop = FALSE], y[i.train][i], offset = offset[i.train][i],
distribution = distribution, w = ifelse(w ==
NULL, NULL, w[i.train][i]), var.monotone = var.monotone,
n.trees = n.trees, interaction.depth = interaction.depth,
n.minobsinnode = n.minobsinnode, shrinkage = shrinkage,
bag.fraction = bag.fraction, train.fraction = mean(cv.group !=
i.cv), keep.data = FALSE, verbose = verbose,
var.names = var.names, response.name = response.name)
cv.error <- cv.error + gbm.obj$valid.error * sum(cv.group ==
i.cv)
}
cv.error <- cv.error/length(i.train)
}
gbm.obj <- gbm.fit(x, y, offset = offset, distribution = distribution,
w = w, var.monotone = var.monotone, n.trees = n.trees,
interaction.depth = interaction.depth, n.minobsinnode = n.minobsinnode,
shrinkage = shrinkage, bag.fraction = bag.fraction, train.fraction = train.fraction,
keep.data = keep.data, verbose = verbose, var.names = var.names,
response.name = response.name)
gbm.obj$Terms <- Terms
gbm.obj$cv.error <- cv.error
gbm.obj$cv.folds <- cv.folds
return(gbm.obj)
}
<environment: namespace:gbm>
A quick read suggests that the data is put into a model frame and that NA's are handled with na.pass
so in turn, ?na.pass
Reading that, it looks like it does nothing special with them, but you'd probably have to read up on the whole fitting process to see what that means in the long run. Looks like you might need to also look at the code of gbm.fit
and so on.
来源:https://stackoverflow.com/questions/14718648/r-gbm-handling-of-missing-values