问题
I am trying to run xgboost for a problem with very noisy features and interested in stopping the number of rounds based on a custom eval_metric that I have defined.
Based on domain knowledge I know that when the eval_metric (evaluated on the training data) goes above a certain value xgboost is overfitting. And I would like to just take the fitted model at that specific number of rounds and not proceed further.
What would be the best way to achieve this ?
It would be somewhat in line with the early stopping criteria but not exactly.
Alternately, if there is a possibility to get the model from an intermediate round ?
Here is an example to better explain by question. (Using the toy example that comes with xgboost help docs and using the default eval_metric)
library(xgboost)
data(agaricus.train, package='xgboost')
train <- agaricus.train
bstSparse <- xgboost(data = train$data, label = train$label, max.depth = 2, eta = 1, nthread = 2, nround = 5, objective = "binary:logistic")
Here is the output
[0] train-error:0.046522
[1] train-error:0.022263
[2] train-error:0.007063
[3] train-error:0.015200
[4] train-error:0.007063
Now lets say from domain knowledge I know that once the train error goes below 0.015 (third round in this case), any further rounds only lead to over fitting. How would I stop the training process after the third round and get hold of the trained model to use it for prediction over a different dataset ?
I need to run the training process over many different datasets and I have no sense of how many rounds it might take to train to get the error below a fixed number, hence I can't set the nrounds argument to a predetermined value. Only intuition I have is that once the training error goes below a number I need to stop further training rounds.
回答1:
# In the absence of any code you have tried or any data you are using then try something like this:
require(xgboost)
library(Metrics) # for rmse to calculate errors
# Assume you have a training set db.train and have some feature indices of interest and a test set db.test
predz <- c(2,4,6,8,10,12)
predictors <- names(db.train[,predz])
# you have some response you are interested in
outcomeName <- "myLabel"
# you may like to include for testing some other parameters like: eta, gamma, colsample_bytree, min_child_weight
# here we look at depths from 1 to 4 and rounds 1 to 100 but set your own values
smallestError <- 100 # set to some sensible value depending on your eval metric
for (depth in seq(1,4,1)) {
for (rounds in seq(1,100,1)) {
# train
bst <- xgboost(data = as.matrix(db.train[,predictors]),
label = db.train[,outcomeName],
max.depth = depth, nround = rounds,
eval_metric = "logloss",
objective = "binary:logistic", verbose=TRUE)
gc()
# predict
predictions <- as.numeric(predict(bst, as.matrix(db.test[,predictors]), outputmargin=TRUE))
err <- rmse(as.numeric(db.test[,outcomeName]), as.numeric(predictions))
if (err < smallestError) {
smallestError = err
print(paste(depth,rounds,err))
}
}
}
# You could adapt this code for your particular evaluation metric and print this out to suit your situation. Similarly you could introduce a break in the code when some specified number of rounds is reached that satisfies some condition you seek to achieve.
来源:https://stackoverflow.com/questions/41816754/stop-xgboost-based-on-eval-metric