R gbm handling of missing values

匆匆过客 提交于 2019-12-03 09:48:09

问题


Does anyone know how gbm in R handles missing values? I can't seem to find any explanation using google.


回答1:


To explain what gbm does with missing predictors, let's first visualize a single tree of a gbm object.

Suppose you have a gbm object mygbm. Using pretty.gbm.tree(mygbm, i.tree=1) you can visualize the first tree on mygbm, e.g.:

  SplitVar SplitCodePred LeftNode RightNode MissingNode ErrorReduction Weight    Prediction
0       46  1.629728e+01        1         5           9      26.462908   1585 -4.396393e-06
1       45  1.850000e+01        2         3           4      11.363868    939 -4.370936e-04
2       -1  2.602236e-04       -1        -1          -1       0.000000    271  2.602236e-04
3       -1 -7.199873e-04       -1        -1          -1       0.000000    668 -7.199873e-04
4       -1 -4.370936e-04       -1        -1          -1       0.000000    939 -4.370936e-04
5       20  0.000000e+00        6         7           8       8.638042    646  6.245552e-04
6       -1  3.533436e-04       -1        -1          -1       0.000000    483  3.533436e-04
7       -1  1.428207e-03       -1        -1          -1       0.000000    163  1.428207e-03
8       -1  6.245552e-04       -1        -1          -1       0.000000    646  6.245552e-04
9       -1 -4.396393e-06       -1        -1          -1       0.000000   1585 -4.396393e-06

See the gbm documentation for details. Each row corresponds to a node, and the first (unnamed) column is the node number. We see that each node has a left and right node (which are set to -1 in case the node is a leaf). We also see each node has associated a MissingNode.

To run an observation down the tree, we start at node 0. If an observation has a missing value on SplitVar = 46, then it will be sent down the tree to the node MissingNode = 9. The prediction of the tree for such observation will be SplitCodePred = -4.396393e-06, which is the same prediction the tree had before any split is made to node zero (Prediction = -4.396393e-06 for node zero).

The procedure is similar for other nodes and split variables.




回答2:


It appears to send missing values to a separate node within each tree. If you have a gbm object called "mygbm" then you'll see by typing "pretty.gbm.tree(mygbm, i.tree = 1)" that for each split in the tree there is a LeftNode a RightNode and a MissingNode. This implies that (assuming you have interaction.depth=1) each tree will have 3 terminal nodes (1 for each side of the split and one for where the predictor is missing).




回答3:


The gbm package in particular deals with NAs (missing values) as follows. The algorithm works by building and serially combining classification or regression trees. So-called base learner trees are built by divvying up observations into Left and Right splits (@user2332165 is right). There is also a separate node type of Missing in gbm. If the row or observation does not have a value for that variable, the algorithm will apply a surrogate split method.

If you want to understand surrogate splitting better, I recommend reading the package rpart vignette.




回答4:


The official guide to gbms introduces missing values to the test data, so I would assume that they are coded to handle missing values.




回答5:


Start with the source code then. Just typing gbm at the console shows you the source code:

function (formula = formula(data), distribution = "bernoulli", 
    data = list(), weights, var.monotone = NULL, n.trees = 100, 
    interaction.depth = 1, n.minobsinnode = 10, shrinkage = 0.001, 
    bag.fraction = 0.5, train.fraction = 1, cv.folds = 0, keep.data = TRUE, 
    verbose = TRUE) 
{
    mf <- match.call(expand.dots = FALSE)
    m <- match(c("formula", "data", "weights", "offset"), names(mf), 
        0)
    mf <- mf[c(1, m)]
    mf$drop.unused.levels <- TRUE
    mf$na.action <- na.pass
    mf[[1]] <- as.name("model.frame")
    mf <- eval(mf, parent.frame())
    Terms <- attr(mf, "terms")
    y <- model.response(mf, "numeric")
    w <- model.weights(mf)
    offset <- model.offset(mf)
    var.names <- attributes(Terms)$term.labels
    x <- model.frame(terms(reformulate(var.names)), data, na.action = na.pass)
    response.name <- as.character(formula[[2]])
    if (is.character(distribution)) 
        distribution <- list(name = distribution)
    cv.error <- NULL
    if (cv.folds > 1) {
        if (distribution$name == "coxph") 
            i.train <- 1:floor(train.fraction * nrow(y))
        else i.train <- 1:floor(train.fraction * length(y))
        cv.group <- sample(rep(1:cv.folds, length = length(i.train)))
        cv.error <- rep(0, n.trees)
        for (i.cv in 1:cv.folds) {
            if (verbose) 
                cat("CV:", i.cv, "\n")
            i <- order(cv.group == i.cv)
            gbm.obj <- gbm.fit(x[i.train, , drop = FALSE][i, 
                , drop = FALSE], y[i.train][i], offset = offset[i.train][i], 
                distribution = distribution, w = ifelse(w == 
                  NULL, NULL, w[i.train][i]), var.monotone = var.monotone, 
                n.trees = n.trees, interaction.depth = interaction.depth, 
                n.minobsinnode = n.minobsinnode, shrinkage = shrinkage, 
                bag.fraction = bag.fraction, train.fraction = mean(cv.group != 
                  i.cv), keep.data = FALSE, verbose = verbose, 
                var.names = var.names, response.name = response.name)
            cv.error <- cv.error + gbm.obj$valid.error * sum(cv.group == 
                i.cv)
        }
        cv.error <- cv.error/length(i.train)
    }
    gbm.obj <- gbm.fit(x, y, offset = offset, distribution = distribution, 
        w = w, var.monotone = var.monotone, n.trees = n.trees, 
        interaction.depth = interaction.depth, n.minobsinnode = n.minobsinnode, 
        shrinkage = shrinkage, bag.fraction = bag.fraction, train.fraction = train.fraction, 
        keep.data = keep.data, verbose = verbose, var.names = var.names, 
        response.name = response.name)
    gbm.obj$Terms <- Terms
    gbm.obj$cv.error <- cv.error
    gbm.obj$cv.folds <- cv.folds
    return(gbm.obj)
}
<environment: namespace:gbm>

A quick read suggests that the data is put into a model frame and that NA's are handled with na.pass so in turn, ?na.pass Reading that, it looks like it does nothing special with them, but you'd probably have to read up on the whole fitting process to see what that means in the long run. Looks like you might need to also look at the code of gbm.fit and so on.



来源:https://stackoverflow.com/questions/14718648/r-gbm-handling-of-missing-values

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!