R gbm handling of missing values

Does anyone know how gbm in R handles missing values? I can't seem to find any explanation using google.

To explain what gbm does with missing predictors, let's first visualize a single tree of a gbm object.

Suppose you have a gbm object mygbm. Using pretty.gbm.tree(mygbm, i.tree=1) you can visualize the first tree on mygbm, e.g.:

  SplitVar SplitCodePred LeftNode RightNode MissingNode ErrorReduction Weight    Prediction
0       46  1.629728e+01        1         5           9      26.462908   1585 -4.396393e-06
1       45  1.850000e+01        2         3           4      11.363868    939 -4.370936e-04
2       -1  2.602236e-04       -1        -1          -1       0.000000    271  2.602236e-04
3       -1 -7.199873e-04       -1        -1          -1       0.000000    668 -7.199873e-04
4       -1 -4.370936e-04       -1        -1          -1       0.000000    939 -4.370936e-04
5       20  0.000000e+00        6         7           8       8.638042    646  6.245552e-04
6       -1  3.533436e-04       -1        -1          -1       0.000000    483  3.533436e-04
7       -1  1.428207e-03       -1        -1          -1       0.000000    163  1.428207e-03
8       -1  6.245552e-04       -1        -1          -1       0.000000    646  6.245552e-04
9       -1 -4.396393e-06       -1        -1          -1       0.000000   1585 -4.396393e-06

See the gbm documentation for details. Each row corresponds to a node, and the first (unnamed) column is the node number. We see that each node has a left and right node (which are set to -1 in case the node is a leaf). We also see each node has associated a MissingNode.

To run an observation down the tree, we start at node 0. If an observation has a missing value on SplitVar = 46, then it will be sent down the tree to the node MissingNode = 9. The prediction of the tree for such observation will be SplitCodePred = -4.396393e-06, which is the same prediction the tree had before any split is made to node zero (Prediction = -4.396393e-06 for node zero).

The procedure is similar for other nodes and split variables.

It appears to send missing values to a separate node within each tree. If you have a gbm object called "mygbm" then you'll see by typing "pretty.gbm.tree(mygbm, i.tree = 1)" that for each split in the tree there is a LeftNode a RightNode and a MissingNode. This implies that (assuming you have interaction.depth=1) each tree will have 3 terminal nodes (1 for each side of the split and one for where the predictor is missing).

The gbm package in particular deals with NAs (missing values) as follows. The algorithm works by building and serially combining classification or regression trees. So-called base learner trees are built by divvying up observations into Left and Right splits (@user2332165 is right). There is also a separate node type of Missing in gbm. If the row or observation does not have a value for that variable, the algorithm will apply a surrogate split method.

If you want to understand surrogate splitting better, I recommend reading the package rpart vignette.

The official guide to gbms introduces missing values to the test data, so I would assume that they are coded to handle missing values.

Start with the source code then. Just typing gbm at the console shows you the source code:

function (formula = formula(data), distribution = "bernoulli", 
    data = list(), weights, var.monotone = NULL, n.trees = 100, 
    interaction.depth = 1, n.minobsinnode = 10, shrinkage = 0.001, 
    bag.fraction = 0.5, train.fraction = 1, cv.folds = 0, keep.data = TRUE, 
    verbose = TRUE) 
{
    mf <- match.call(expand.dots = FALSE)
    m <- match(c("formula", "data", "weights", "offset"), names(mf), 
        0)
    mf <- mf[c(1, m)]
    mf$drop.unused.levels <- TRUE
    mf$na.action <- na.pass
    mf[[1]] <- as.name("model.frame")
    mf <- eval(mf, parent.frame())
    Terms <- attr(mf, "terms")
    y <- model.response(mf, "numeric")
    w <- model.weights(mf)
    offset <- model.offset(mf)
    var.names <- attributes(Terms)$term.labels
    x <- model.frame(terms(reformulate(var.names)), data, na.action = na.pass)
    response.name <- as.character(formula[[2]])
    if (is.character(distribution)) 
        distribution <- list(name = distribution)
    cv.error <- NULL
    if (cv.folds > 1) {
        if (distribution$name == "coxph") 
            i.train <- 1:floor(train.fraction * nrow(y))
        else i.train <- 1:floor(train.fraction * length(y))
        cv.group <- sample(rep(1:cv.folds, length = length(i.train)))
        cv.error <- rep(0, n.trees)
        for (i.cv in 1:cv.folds) {
            if (verbose) 
                cat("CV:", i.cv, "\n")
            i <- order(cv.group == i.cv)
            gbm.obj <- gbm.fit(x[i.train, , drop = FALSE][i, 
                , drop = FALSE], y[i.train][i], offset = offset[i.train][i], 
                distribution = distribution, w = ifelse(w == 
                  NULL, NULL, w[i.train][i]), var.monotone = var.monotone, 
                n.trees = n.trees, interaction.depth = interaction.depth, 
                n.minobsinnode = n.minobsinnode, shrinkage = shrinkage, 
                bag.fraction = bag.fraction, train.fraction = mean(cv.group != 
                  i.cv), keep.data = FALSE, verbose = verbose, 
                var.names = var.names, response.name = response.name)
            cv.error <- cv.error + gbm.obj$valid.error * sum(cv.group == 
                i.cv)
        }
        cv.error <- cv.error/length(i.train)
    }
    gbm.obj <- gbm.fit(x, y, offset = offset, distribution = distribution, 
        w = w, var.monotone = var.monotone, n.trees = n.trees, 
        interaction.depth = interaction.depth, n.minobsinnode = n.minobsinnode, 
        shrinkage = shrinkage, bag.fraction = bag.fraction, train.fraction = train.fraction, 
        keep.data = keep.data, verbose = verbose, var.names = var.names, 
        response.name = response.name)
    gbm.obj$Terms <- Terms
    gbm.obj$cv.error <- cv.error
    gbm.obj$cv.folds <- cv.folds
    return(gbm.obj)
}
<environment: namespace:gbm>

A quick read suggests that the data is put into a model frame and that NA's are handled with na.pass so in turn, ?na.pass Reading that, it looks like it does nothing special with them, but you'd probably have to read up on the whole fitting process to see what that means in the long run. Looks like you might need to also look at the code of gbm.fit and so on.

来源：https://stackoverflow.com/questions/14718648/r-gbm-handling-of-missing-values

标签

missing-data