Why are there discrepancies in xgboost regression prediction from individual trees?

问题

First I run a very simple xgb regression model which contains only 2 trees with 1 leaf each. Data available here. (I understand this is a classification dataset but I just force the regression to demonstrate the question here):

import numpy as np
from numpy import loadtxt
from xgboost import XGBClassifier,XGBRegressor
from xgboost import plot_tree
import matplotlib.pyplot as plt

plt.rc('figure', figsize=[10,7])


# load data
dataset = loadtxt('pima-indians-diabetes.csv', delimiter=",")
# split data into X and y
X = dataset[:,0:8]
y = dataset[:,8]
# fit model no training data
model = XGBRegressor(max_depth=0, learning_rate=0.1, n_estimators=2,random_state=123)
model.fit(X, y)

Plotting the trees, we see that the 2 trees give a prediction value of -0.0150845 and -0.013578

plot_tree(model, num_trees=0) # 1ST tree, gives -0.0150845
plot_tree(model, num_trees=1) # 2ND tree, gives -0.013578

But if we run predictions with the 1st tree and both trees, they give reasonable values:

print(X[0])
print(model.predict(X[0,None],ntree_limit=1)) # 1st tree only
print(model.predict(X[0,None],ntree_limit=0)) # ntree_limit=0: use all trees

# output:
#[  6.    148.     72.     35.      0.     33.6     0.627  50.   ]
#[0.48491547]
#[0.47133744]

So there are two questions here:

How do the trees' predictions "-0.0150845" and "-0.013578" relate to the final output "0.48491547" and "0.48491547"? Apparently there is some transformation going on here.
If there is only 1 leaf for the trees, to minimize squared error (default objective of XGBRegressor), shouldn't the first tree predict just the sample mean of y which is ~0.3?

UPDATE: I figured out Q1: there is a base_score=0.5 default parameter in XGBRegressor which shifts the prediction (which only makes sense in binary classification problem). But for Q2, even after I set base_score=0, the first leaf gives value close to y sample mean, but not exact. So there is still something missing here.

回答1:

This behavior is a characteristic of Gradient Boosted trees. The first tree contains the base predictions of your data. So, dropping first tree will dramatically reduce the performance of your model. Here's the algorithm of gradient boosting:
1. y_pred = 0, learning_rate = 0.x
2. Repeat at train time:
i. residual = residual + learning_rate*(y - y_pred)
ii. i'th tree = XGBRegressor(X, residual)
iii. y_pred = i'th tree.predict(X)
3. Repeat at test time:
i. prediction += learning_rate*i'th tree.predict(X_test)

Answer to your first question: So, the first tree predicts most part of your data while all other trees try to reduce the error of the previous tree. This is the reason you observe good predictions using just first tree but bad ones using the second tree. What you are observing there is the error between your two trees.
Answer to your second question: Not all the frameworks initialize the value of residual using the mean of your target values. Many frameworks simply initalize it to 0.
If you want to visualize Gradient Boosting, here's a good link
Youtube video guiding through algorithm of GDBT.
I hope this helps!

来源：https://stackoverflow.com/questions/56621607/why-are-there-discrepancies-in-xgboost-regression-prediction-from-individual-tre

标签

python

machine-learning

regression

prediction

xgboost