【推荐】2019 Java 开发者跳槽指南.pdf(吐血整理) >>>
The gradient boosting decision tree (GBDT) is one of the best performing classes of algorithms in machine learning competitions. One implementation of the gradient boosting decision tree – xgboost – is one of the most popular algorithms on Kaggle. Among the 29 challenge winning solutions published at Kaggle’s blog during 2015, 17 used xgboost. If you take a look at the kernels in a Kaggle competition, you can clearly see how popular xgboost is.
The search results for all kernels that had xgboost in their titles for the Kaggle Quora Duplicate Question Detection competition. There are a lot more if you scroll down, and there are plenty of kernels that use xgboost but do not mention that in their titles.
Though xgboost seemed to be the go-to algorithm in Kaggle for a while, a new contender is quickly gaining traction: lightGBM. Released from Microsoft, this algorithm has been claimed to be more efficient (better predictive performance for the same running time) than xgboost.
This post delves into the details of both xgboost and lightGBM and what makes them so effective. By understanding the underlying algorithms, it should be easier to understand what each parameter means, which will make it easier to conduct effective hyperparameter tuning.
As a word of caution, this post does not aim to provide a conclusion as to which algorithm is superior. The superiority of each algorithm is probably dependent on the data, and if you are wondering which to use, I would honestly recommend trying out both and comparing the results. It’s not that time consuming or difficult. Furthermore, both xgboost and lightGBM are constantly being updated, so some features that were previously exclusive to lightGBM are being incorporated into xgboost. That makes it all the more fruitful to focus on both libraries without focusing too much on their difference and relative superiority.
0. Background: Gradient Boosting Decision Trees
If you are already familiar with the concept of Gradient Boosting Decision Trees (which we will henceforth refer to as GBDTs), you can skip to the next section.
For those that are not, let me give a brief and minimal overview.
In order to understand GBDTs, we need to understand the decision tree. Decision trees are a method of splitting the data based on features to either classify or predict some value. Each branch in a decision tree divides the data into one of two (or several, if the tree is not binary) groups. Each leaf node is allocated with a single label (class or predicted value). When predicting using the decision tree, the data is allocated to the appropriate leaf node, and the prediction is the label of that leaf node.
A simple decision tree for predicting whether a person will buy a computer. In this case, the model would predict that a young student would buy a computer, whereas a senior without an excellent credit rating would not.
Decision trees are flexible and interpretable. However, a single decision tree is prone to overfitting and is unlikely to generalize well. There are various ways of restricting the flexibility of a decision tree, such as by limiting its depth, but those methods then cause the decision tree to underfit. This is why decision trees are generally not used alone: instead, multiple decision trees are used together. Gradient boosting decision trees are one method (among many) of combining the predictions of multiple decision trees to make predictions that generalize well.
Despite their strength, the idea behind GBTDs is very simple: combine the predictions of multiple decision trees by adding them together. For instance, if we were trying to predict housing prices, the predicted price for any data point would be the sum of the predictions of each individual decision tree.
GBDTs are trained iteratively – i.e. one tree at a time. For instance, a GBDT that attempts to predict housing prices would first train a simple, weak decision tree on the data and raw housing prices. The decision tree is trained to minimize a loss function – such as the mean squared error – by recursively splitting the data in a way that maximizes some criterion until some limit – such as the depth of the tree – is met. The criterion is chosen so that the loss function is (approximately) minimized by each split. One commonly used criterion is the Gini index, which measures the “impurity” of a leaf node in the case of binary classification.
The recursive processing of training a decision tree
The next tree is then trained to minimize the loss function when its outputs are added to the first tree. This is (approximately) achieved by recursively splitting the data according to a new criterion. Though the details are beyond the scope of this post, the criterion can be simply calculated for any split of data based on the gradient statistics (the value of the gradient for each data point).
The important thing to note here is that computing the best split requires the model to go through various splits and compute the criterion for each split. There is no analytical solution for determining the best split at each stage. As we will see later, this is one of the key challenges when training GBDTs.
1. The Motivation behind xgboost and lightGMB
Xgboost and lightGBM are not the only implementations of GBDTs. The reason why xgboost and lightGBM are treated like they are the representatives of GBDTs is that they
(1) both have easy-to-use open source implementations
(2) are fast and accurate.
(2) is particularly important; though the GBDT is implemented in sklearn, it is much slower than xgboost and lightGBM.
The difference between xgboost and lightGBM is in the specifics of the optimizations. Below, we will go through the various ways in which xgboost and lightGBM improve upon the basic idea of GBDTs to train accurate models efficiently.
2. Growing the Tree
Both xgboost and lightGBM use the leaf-wise growth strategy when growing the decision tree. When training each individual decision tree and splitting the data, there are two strategies that can be employed: level-wise and leaf-wise.
The level-wise strategy maintains a balanced tree, whereas the leaf-wise strategy splits the leaf that reduces the loss the most.
An illustration demonstrating the difference between level-wise and leaf-wise growth
Level-wise training can be seen as a form of regularized training since leaf-wise training can construct any tree that level-wise training can, whereas the opposite does not hold. Therefore, leaf-wise training is more prone to overfitting but is more flexible. This makes it a better choice for large datasets and is the only option available in lightGBM.
Incidentally, this is a reason previous users of decision tree models should be careful when tuning the num_leaves
/max_depth
parameters. Compared to the case of level-wise growth, a tree grown with leaf-wise growth will be deeper when the number of leaves is the same. This means that the same max_depth
parameter can result in trees with vastly different levels of complexity depending on the growth strategy.
Previously, leaf-wise growth was an exclusive feature of lightGBM, but xgboost has since implemented this growth strategy (this change has not been reflected in the lightGBM docs, but has been acknowledged in a blog post). This strategy is only available for the histogram-based method (which I will explain below), so in order to use it, users will have to set the tree_method
parameter to hist
and the grow_policy
parameter to lossguide
.
3. Finding the Best Split
The key challenge in training a GBDT is the process of finding the best split for each leaf. When naively done, this step requires the algorithm to go through every feature of every data point. The computational complexity is thus
.Modern datasets tend to be both large in the number of samples and the number of features. For instance, a tf-idf matrix of a million documents with a vocabulary size of 1 million would have a trillion entries. Thus, a naive GBDT would take forever to train on such datasets.
There is no method that can find the best split while avoiding going through all features of all data points. Therefore, the various methods that xgboost and lightGBM present are methods of finding the approximate best split.
3.1 Histogram-based methods (xgboost and lightGBM)
The amount of time it takes to build a tree is proportional to the number of splits that have to be evaluated. Often, small changes in the split don’t make much of a difference in the performance of the tree. Histogram-based methods take advantage of this fact by grouping features into a set of bins and perform splitting on the bins instead of the features. This is equivalent to subsampling the number of splits that the model evaluates. Since the features can be binned before building each tree, this method can greatly speed up training, reducing the computational complexity to
.An example of how binning can reduce the number of splits to explore. The features must be sorted in advance for this method to be effective.
Though conceptually simple, histogram-based methods present several choices that the user has to make. Firstly the number of bins creates a trade-off between speed and accuracy: the more bins there are, the more accurate the algorithm is, but the slower it is as well. Secondly, how to divide the features into discrete bins is a non-trivial problem: dividing the bins into equal intervals (the most simple method) can often result in an unbalanced allocation of data. Though the details are beyond the scope of this post, the “most balanced” method of dividing the bins actually depends on the gradient statistics. Xgboost offers the option tree_method=approx
, which computes a new set of bins at each split using the gradient statistics. LightGBM and xgboost with thetree_method
set to hist
will both compute the bins at the beginning of training and reuse the same bins throughout the entire training process.
3.2 Ignoring sparse inputs (xgboost and lightGBM)
Xgboost and lightGBM tend to be used on tabular data or text data that has been vectorized. Therefore, the inputs to xgboost and lightGBM tend to be sparse. Since the vast majority of the values will be 0, having to look through all the values of a sparse feature is wasteful. Xgboost proposes to ignore the 0 features when computing the split, then allocating all the data with missing values to whichever side of the split reduces the loss more. This reduces the number of samples that have to be used when evaluating each split, speeding up the training process.
Incidentally, xgboost and lightGBM both treat missing values in the same way as xgboost treats the zero values in sparse matrices; it ignores them during split finding, then allocates them to whichever side reduces the loss the most.
Though lightGBM does not enable ignoring zero values by default, it has an option called zero_as_missing
which, if set to True
, will regard all zero values as missing. According to this thread on GitHub, lightGBM will treat missing values in the same way as xgboost as long as the parameter use_missing
is set to True
(which is the default behavior).
3.3 Subsampling the data: Gradient-based One-Side Sampling (lightGBM)
This is a method that is employed exclusively in lightGBM. The essential observation behind this method is that not all data points contribute equally to training; data points with small gradients tend to be more well trained (close to a local minima). This means that it is more efficient to concentrate on data points with larger gradients.
The most straightforward way to use this observation is to simply ignore data points with small gradients when computing the best split. However, this has the risk of leading to biased sampling, changing the distribution of data. For instance, if data that belonged to the “young” age group tended to be less well trained, the sampled data will have a much younger age distribution. This means that the split is likely to be younger than the optimal value.
In order to mitigate this problem, lightGBM also randomly samples from data with small gradients. This results in a sample that is still biased towards data with large gradients, so lightGBM increases the weight of the samples with small gradients when computing their contribution to the change in loss (this is a form of importance sampling, a technique for efficient sampling from an arbitrary distribution).
3.4 Exclusive Feature Bundling (lightGBM)
This is a method introduced in lightGBM that also takes advantage of the sparsity of large datasets. The essential observation behind this method is that the sparsity of features means that some features are never non-zero together. For instance, the words “Python” and “protein” might never appear in the same document in the data. This means that these features can be “bundled” into a single feature without losing any information. Suppose the tf-idf score for “Python” ranges from 0 to 10 and the tf-idf score for “protein” ranges from 0 to 20. In this case, the feature
would range from 0 to 30, and can be converted back to the original tf-idf scores.
Unfortunately, the problem of finding the most efficient bundle is NP-hard. Therefore, the authors of the paper opted for an approximate algorithm that tolerated a certain degree of overlap
between the non-zero elements within a feature bundle. The details of this algorithm are beyond the scope of this post, so please refer to the original paperfor details.
4. Conclusion and Further Readings
Xgboost and lightGBM are very powerful and effective algorithms that can be used out of the box without understanding their internals (that is, ironically, one of the reasons for their success). Despite this, knowing the internals can be of great assistance when tuning or using the algorithms in practice. With the above explanations in mind, you should be able to better understand what the various hyperparameters in xgboost and lightGBM mean and how they will affect the training.
I didn’t cover actual code for using these algorithms or details of the algorithms, so for those who want to learn more, I direct them to the following resources.
Xgboost
A blog post on how to use xgboost
LightGBM
A blog post by Microsoft on lightGBM
Both
A Kaggle kernel comparing xgboost and lightGBM
来源:oschina
链接:https://my.oschina.net/u/2624214/blog/1922753