Gradient Boosting using gbm in R with distribution = “bernoulli”

问题

I am using gbm package in R and applying the 'bernoulli' option for distribution to build a classifier and i get unusual results of 'nan' and i'm unable to predict any classification results. But i do not encounter the same errors when i use 'adaboost'. Below is the sample code, i replicated the same errors with the iris dataset.

## using the iris data for gbm
library(caret)
library(gbm)
data(iris)
Data  <- iris[1:100,-5]
Label <- as.factor(c(rep(0,50), rep(1,50)))

# Split the data into training and testing
inTraining <- createDataPartition(Label, p=0.7, list=FALSE)
training <- Data[inTraining, ]
trainLab <- droplevels(Label[inTraining])
testing <- Data[-inTraining, ]
testLab <- droplevels(Label[-inTraining])

# Model
model_gbm <- gbm.fit(x=training, y= trainLab,
                     distribution = "bernoulli",
                     n.trees = 20, interaction.depth = 1,
                     n.minobsinnode = 10, shrinkage = 0.001,
                     bag.fraction = 0.5, keep.data = TRUE, verbose = TRUE)

## output on the console
Iter      TrainDeviance   ValidDeviance   StepSize   Improve
     1          -nan            -nan     0.0010      -nan
     2           nan            -nan     0.0010       nan
     3          -nan            -nan     0.0010      -nan
     4           nan            -nan     0.0010       nan
     5          -nan            -nan     0.0010      -nan
     6           nan            -nan     0.0010       nan
     7          -nan            -nan     0.0010      -nan
     8           nan            -nan     0.0010       nan
     9          -nan            -nan     0.0010      -nan
    10           nan            -nan     0.0010       nan
    20           nan            -nan     0.0010       nan

Please let me know if there is a work around to get this working. The reason i am using this is to experiment with Additive Logistic Regression, please suggest if there are any other alternatives in R to go about this.

Thanks.

回答1:

Is there a reason you are using gbm.fit() instead of gbm()?

Based on the package documentation, the y variable in gbm.fit() needs to be a vector.

I tried making sure the vector was forced using

trainLab <- as.vector(droplevels(Label[inTraining])) #vector of chars

Which gave the following output on the console. Unfortunately I'm not sure why the valid deviance is still -nan.

Iter   TrainDeviance   ValidDeviance   StepSize   Improve
 1        1.3843            -nan     0.0010    0.0010
 2        1.3823            -nan     0.0010    0.0010
 3        1.3803            -nan     0.0010    0.0010
 4        1.3783            -nan     0.0010    0.0010
 5        1.3763            -nan     0.0010    0.0010
 6        1.3744            -nan     0.0010    0.0010
 7        1.3724            -nan     0.0010    0.0010
 8        1.3704            -nan     0.0010    0.0010
 9        1.3684            -nan     0.0010    0.0010
10        1.3665            -nan     0.0010    0.0010
20        1.3471            -nan     0.0010    0.0010

回答2:

train.fraction should be <1 to get ValidDeviance, because this way we are creating a validation dataset.

Thanks!

来源：https://stackoverflow.com/questions/23530165/gradient-boosting-using-gbm-in-r-with-distribution-bernoulli

标签

classification

logistic-regression

adaboost

gbm