# Popular understanding of xgboost

2020-12-07 12:51:23

## Easy to understand kaggle Game killer xgboost

explain ： If some pictures can not be displayed normally and affect reading , Please refer to the article here ：xgboost Question bank version .
Time ： March 25, 2019 .

### 0 Preface

xgboost It's been passed down as a artifact in the competition , For example, from time to time kaggle/ In the Tianchi competition , Use sb. xgboost Won the championship among thousands of troops .

And our machine learning course will also talk about xgboost, As Han said ：“RF and GBDT It's a model of industrial love ,Xgboost It's a big killer package ,Kaggle Various Top Once upon a time, the rankings showed Xgboost The situation of unifying the rivers and lakes , In addition, the improvement of the first place in a certain didi competition is also indispensable Xgboost Credit ”.

Besides , Company July online from 2016 Since the first half of the year , I started to organize students to participate in various competitions , In order to grow in the actual competition （ After all , To make AI It can't be without actual combat , And the competition goes through data processing 、 feature selection 、 Model tuning 、 Code call parameters , It's a great opportunity for real combat , The ability to improve and find / Changing jobs helps a lot ）.

AI Under the tide , This year, especially from the traditional IT Job transfer and transformation AI Friend, , Many friends ask how to change careers AI, I usually emphasize learning AI Or look for / in AI Of Four King Kong Course + Question bank + OJ + kaggle/ tianchi . Including the training camp graduation assessment will be more integrated kaggle Or Tianchi competition .

in consideration of kaggle/ The importance of Tianchi competition in mathematics science , Close up this article introduces xgboost, Help you get started quickly xgboost And getting good results in the competition .

Last ,xgboost Is not my July Invented , But I'll make sure it's the most accessible introduction in this article （ And this article gets July online AI lab The person in charge, Dr. Chen, revises ）. another , Thank you for all the references listed at the end of the paper , What's the problem , Feel free to leave a comment ,thanks.

### 1 Decision tree

for instance , There was a period of training camp 100 Many students , Suppose you were given a task , I want you to count the number of boys and girls , When one student comes up to you one by one , How do you distinguish between men and women ？

Soon , You consider that boys' hair is usually very short , Girls usually have long hair , So you divide all the students in this class into two groups by the length of their hair , Long hair is “ Woman ”, Short hair is “ male ”.

It's like you rely on an indicator “ Hair length ” Divide the whole class , So a simple decision tree is formed , And the division is based on the length of the hair .
At this time , Some people may have different opinions ： Why use “ Hair length ” Divide , Can I use “ Whether the shoes you are wearing are high heels ”,“ Do you have a Adam's apple ” Wait for these to divide , The answer, of course, is yes .

But which index is better ？ It is very direct to judge which classification effect is better and which one is preferred . therefore , At this point, we need an evaluation standard to quantify the classification effect .

How to judge “ Hair length ” perhaps “ Whether there is a laryngeal knot ” It's the best way to divide , How to quantify the effect ？ Intuitively speaking , If people are classified according to a certain standard , The higher the purity, the better , For example, you're divided into two groups ,“ Woman ” That group is all women ,“ male ” That group is all men , Then this effect is the best . But sometimes the actual classification is not so ideal , So the closer we get to this situation , We think the better .

There are many ways to quantify the effect of classification , For example, information gain （ID3）、 Information gain rate （C4.5）、 The gini coefficient （CART） wait .

The measure of information gain ： entropy

ID3 The core idea of the algorithm is to measure attribute selection by information gain , Select the attribute with the largest information gain after splitting to split .

What is information gain ？ To precisely define information gain , Let's first define a metric widely used in information theory , be called entropy （entropy）, It characterizes the purity of any sample set （purity）. Given a set of examples that contain positive and negative examples about a target concept S, that S The entropy relative to this Boolean classification is ：

In the above formula ,p+ Represents a positive example , For example, in the second example at the beginning of this article p+ It means playing badminton , and p- Is a negative example , Don't play ( In all calculations of entropy, we define 0log0 by 0).

for instance , hypothesis S It's a Boolean concept with 14 A collection of samples , It includes 9 A positive example and 5 A counterexample （ We use symbols [9+,5-] To summarize such data samples ）, that S The entropy relative to this Boolean example is ：

Entropy（[9+,5-]）=-（9/14）log2（9/14）-（5/14）log2（5/14）=0.940.

So, According to the above formula , We can get ：

• If S All members of the same class , be Entropy(S)=0;
• If S The number of positive and negative samples of is equal , be Entropy(S)=1;
• If S The number of positive and negative examples is different , Then entropy is between 0,1 Between

As shown in the figure below ：

You see , adopt Entropy Value , You can evaluate the classification effect of the current classification tree .

More details like pruning 、 Over fitting 、 Advantages and disadvantages 、 You can refer to this article 《 Decision tree learning 》.

therefore , Now the soul of the decision tree is in place , That is, depending on some index to split the tree to achieve classification / The purpose of the return , Always hope that the higher the purity, the better .

### 2. Regression tree and ensemble learning

If you define it in one sentence xgboost, It's simple ：Xgboost It's just a lot of CART Tree Integration . but , What is? CART Trees ？

There are two main types of decision trees used in data mining or machine learning ：

1. Classification tree analysis means that the predicted result is the class to which the data belongs （ Like a movie to see or not to see ）
2. Regression tree analysis means that the predicted results can be considered as real numbers （ For example, the price of a house , Or the length of stay in the hospital ）

And the term classification regression tree （CART,Classification And Regression Tree） Analysis is a general term used to refer to the above two kinds of trees , from Breiman Others first put forward .

2.1 Back to the tree
in fact , Classification and regression are two very close questions , The goal of classification is based on certain characteristics of known samples , Determine what kind of known sample class a new sample belongs to , The result is a discrete value . And the result of regression is a continuous value . Of course , It's essentially the same thing , It's all features （feature） To the end / label （label） Mapping between .

After sorting out what is classification and regression , It's not difficult to understand classification trees and regression trees .

Sample output of the classification tree （ The response value ） It's the form of a class , For example, judge whether the life-saving medicine is true or not , Go to the movies on the weekend 《 Wind spell 》 Or not? . The sample output of regression tree is in the form of numerical value , For example, the amount of a house loan to someone is a specific number , It can be 0 To 300 Any value between ten thousand yuan .

therefore , For the regression tree , You can't use the information gain of the classification tree 、 Information gain rate 、 Gini coefficient is used to determine whether the node of the tree is split , You need to take a new approach to assessing the effect , Including prediction errors （ The commonly used one is mean square error 、 Logarithmic error, etc ）. And nodes are no longer categories , Numerical value （ Predictive value ）, So how to be sure ？ Some are the mean values of the samples within the nodes , Some are optimized, such as Xgboost.

CART A regression tree is a binary tree , By constantly splitting features . For example, the current tree node is based on j Split eigenvalues , Let the eigenvalue be less than s The sample is divided into left subtrees , Greater than s The sample is divided into right subtrees .

and CART The essence of regression tree is to divide the sample space in the feature dimension , And this kind of space partition optimization is a kind of NP Difficult problem , therefore , In the decision tree model, heuristic method is used to solve the problem . A typical CART The objective function produced by the regression tree is ：

therefore , When we want to solve the optimal segmentation feature j And the best cut-off point s, It turns into solving such an objective function ：

So we just go through all the segmentation points of all the features , We can find the best segmentation features and points . Finally, we get a return tree .

2.2 boosting Integrated learning

Integrated learning , It means building multiple classifiers （ Weak classifier ） Forecasting data sets , Then, a strategy is used to integrate the predicted results of multiple classifiers , As a final prediction . The popular metaphor is “ Three smelly cobblers are better than Zhuge Liang ”, Or the directors of a company's board of directors vote to make decisions , It requires each weak classifier to have a certain “ accuracy ”, There are “ Difference ”.

Ensemble learning depends on whether there is a dependency between the weak classifiers , It is divided into Boosting and Bagging Two schools ：

1. Boosting Schools , There is a dependency between classifiers , It has to be serial , such as Adaboost、GBDT(Gradient Boosting Decision Tree)、Xgboost
2. Bagging Schools , There is no dependency between classifiers , They can go in parallel , For example, random forest （Random Forest）

Famous Adaboost As boosting One of the most representative methods of the genre , This blog has described it in detail .

AdaBoost, It's English "Adaptive Boosting"（ Adaptive enhancement ） Abbreviation , from Yoav Freund and Robert Schapire stay 1995 in . Its adaptation lies in ： The samples of the previous basic classifier will be enhanced , The weighted samples are again used to train the next basic classifier . meanwhile , Add a new weak classifier in each round , Until a predetermined small error rate is reached or a predetermined maximum number of iterations is reached .

say concretely , Whole Adaboost Iterative algorithm 3 Step ：

1. Initialize the weight distribution of training data . If there is N Samples , Each training sample is given the same weight at the beginning ：1/N.
2. Train weak classifiers . In the specific training process , If a sample point has been accurately classified , So in the construction of the next training set , Its weight is reduced ; contrary , If a sample point is not accurately classified , Then its weight is increased . then , The weight updated sample set is used to train the next classifier , The whole training process goes on so iteratively .
3. Combine the weak classifiers to strong classifiers . After the training of each weak classifier , Increase the weight of weak classifier with small classification error rate , Make it play a decisive role in the final classification function , And reduce the weight of weak classifiers with high classification error rate , Make it play a small decisive role in the final classification function . In other words , The weak classifier with low error rate takes up more weight in the final classifier , Otherwise smaller .

And another kind. boosting Method GBDT（Gradient Boost Decision Tree), Then with AdaBoost Different ,GBDT Each calculation is to reduce the last residual error , And then the residuals are reduced （ Negative gradient ） To establish a new model in the direction of .

boosting Ensemble learning consists of multiple associated decision trees , What is correlation ？ for instance

1. There is a sample [ data -> label ] yes ：[(2,4,5)-> 4]
2. The prediction trained by the first decision tree with this sample is 3.3
3. So the input for training the second decision tree , This sample becomes ：[(2,4,5)-> 0.7]
4. in other words , The input sample of the next decision tree is related to the training and prediction of the previous decision tree

You'll soon realize that ,Xgboost Why is it also a boosting Integrated learning of .

And the key to the formation of a regression tree is ：

• What is the division of the split point （ As mentioned above, the mean square error is the smallest ,loss）;
• What is the predicted value of the node after classification （ As said before , One is to take the mean value of each sample under the leaf node as the prediction error of leaf node , Or calculated ）

As for another kind of ensemble learning method , such as Random Forest（ Random forests ） Algorithm , Each decision tree is independent 、 Each decision tree randomly selects a batch of samples from the sample heap , Randomly select a group of features for independent training , There is no relationship between decision trees . This article will not introduce .

### 3.GBDT

Speaking of Xgboost, Have to start with GBDT(Gradient Boosting Decision Tree) Speaking of . because xgboost It's still essentially one GBDT, But strive to maximize speed and efficiency , So called X (Extreme) GBoosted. Including what I said before , Both are boosting Method .

GBDT It's very simple , The sum of the results of all weak classifiers is equal to the predicted value , Then the next weak classifier is used to fit the gradient of the error function to the predicted value / residual ( This gradient / The residual is the error between the predicted value and the real value ). Yes, of course , The weak classifier in it is represented by trees . As shown in the figure ：Y = Y1 + Y2 + Y3.

Take a very simple example , Like this year 30 Year old , But computers or models GBDT I don't know how old I am , that GBDT Do how? ？

1. It will be in the first weak classifier （ Or in the first tree ） Use any age for example 20 Age to fit , And then we find that the error is 10 year ;
2. Next in the second tree , use 6 Age to fit the rest of the loss , We find that there is still a gap 4 year ;
3. Then use... In the third tree 3 Age fits the remaining gap , Find that the gap is only 1 Year old ;
4. Finally, use... In the tree of lesson 4 1 Years fit the rest of the residuals , perfect .

Final , The conclusion of the four trees adds up , It's the real age 30 year . In practical engineering ,gbdt It's calculating the negative gradient , Use negative gradient to approximate residual .

Be careful , why gbdt We can use the negative gradient to approximate the residual ？

Return to the task ,GBDT In each iteration, there will be a prediction value for each sample , The loss function at this time is the loss function of mean square deviation ,

So the negative gradient at this point is calculated in this way

therefore , When the loss function is mean square loss function , The value of every fit is （ True value - The value predicted by the current model ）, That is, residual . The variable here is , namely “ The value of the current forecast model ”, That is, to find a negative gradient for it .

in addition , There's a bit more to be said here , The first step in predicting age above “ casual ” Two words seem casual , In fact, it's not casual to think deeply about it , You'll find that most of the forecasting models , It's basically a routine , First use a random value to predict , Then compare the difference between the predicted value and the real value , Finally, keep adjusting Narrow the gap . So there will be a series of objective functions ： Set goals , And the loss function ： Narrow the error .

Think more about , You'll find that it's perfectly consistent with the common sense of human prediction 、 Common practice , When you don't know a thing well , At the beginning, it was also based on experience 、 On , Until approaching in a sense or in perfect agreement .

Or an example of age prediction .

Simplicity , Suppose the training set has only 4 personal ：A,B,C,D, Their ages are 14,16,24,26. among A、B They are senior one and senior three students ;C,D They are fresh graduates and two-year employees .

therefore , Now the problem is that we have to predict this 4 Personal age , How to start ？ It's simple , Start with an age, like 20 To fit them , And then adjust it according to the actual situation .

If we use a traditional regression decision tree to train , You will get the result as shown in the figure below ：

Now we use GBDT To do it , Because there's so little data , We limit the number of leaf nodes to two , That is, every tree has only one branch , And limited to two trees .

We will get the result as shown in the figure below ：

In the first tree branches and FIG 1 equally , because A,B The age is relatively similar ,C,D The age is relatively similar , They are divided into two groups , Use the average age of each dial as a predictor .

• At this point, calculate the residual （ Residual means ：A The real value of - A The predicted value of = A Residual of ）, therefore A The residual of is the actual value 14 - Predictive value 15 = residual -1.
• Be careful ,A The predicted value of is the sum of all the previous trees , There's only one tree in front of here, so it's directly 15, If there are still trees, they need to be added up as A The predicted value of .

In mathematical statistics, residuals refer to actual observations and estimates （ Fit value ） The difference between .“ residual ” It contains important information about the basic assumptions of the model . If the regression model is correct , We can think of residuals as observations of errors .

And then you get A,B,C,D The residuals are respectively -1,1,-1,1.

Then take their residuals -1、1、-1、1 Instead of A B C D Original value of , Go to the second tree to learn , The second tree has only two values 1 and -1, Directly divided into two nodes , namely A and C On the left ,B and D On the right , After calculation （ such as A, actual value -1 - Predictive value -1 = residual 0, such as C, actual value -1 - Predictive value -1 = 0）, At this point, everyone's residual is 0.

The residual values are 0, It is equivalent to the predicted value of the second tree and their actual value , Then just add the conclusion of the second tree to the first tree to get the real age , That is, everyone gets the real prediction .

let me put it another way , Now? A,B,C,D The predictions are consistent with the real age .Perfect！
A: 14 Senior one year old , Less shopping , Often ask the seniors questions , Predict age A = 15 – 1 = 14
B: 16 Three year old senior , Less shopping , He is often asked questions by his younger brother , Predict age B = 15 + 1 = 16

C: 24 Year old fresh graduate , More shopping , Often ask elder martial brother questions , Predict age C = 25 – 1 = 24
D: 26 Two years old employee , More shopping , Often asked questions by younger martial brother , Predict age D = 25 + 1 = 26

therefore ,GBDT You need to add up the scores of multiple trees to get the final predicted score , And every iteration , All on the basis of existing trees , Add a tree to fit the residual between the predicted result of the previous tree and the real value .

### 4.Xgboost

4.1 xgboost The definition of a tree

The schematic diagram of this section is basically quoted from xgboost The original author Chen Tianqi's handout PPT in .

for instance , We need to predict how much a family likes video games , Considering the comparison between young and old , Young people are more likely to like video games , And men and women , Men prefer video games , So first, children and adults are differentiated according to their age , Then we can distinguish the male from the female by gender , Give each person a score on the level of video game preference , As shown in the figure below .

That's it , Training out 2 tree tree1 and tree2, Similar before gbdt Principle , The conclusion of the two trees is the final conclusion , So the child's prediction score is the sum of the scores of the nodes where the child falls in the two trees ：2 + 0.9 = 2.9. Grandpa's prediction score is the same ：-1 + （-0.9）= -1.9. The details are shown in the following figure

Okay , You may be about to make a case , Exclaim , This is not the one introduced above gbdt Is it the same thing with different tunes ？

in fact , If we don't consider engineering realization 、 Solve some of the differences ,xgboost And gbdt The big difference is the definition of the objective function .xgboost The objective function of is shown in the figure below ：

among

• The red arrow points to L That's the loss function （ Like the square loss function ：, or logistic Loss function ：
• The red box is a regular term （ Include L1 Regular 、L2 Regular ）
• The red circle is a constant term
• about ,xgboost Use Taylor to expand three items , Make an approximation

We can see clearly , The final objective function only depends on the first and second derivatives of each data point on the error function .

forehead , the path winds along mountain ridges , Suddenly throw such a big formula , Many people may be confused in a moment . Don't worry , Let's take this objective function apart , And analyze each formula one by one 、 Every symbol 、 The meaning of each subscript is .

4.2 xgboost Objective function

xgboost The core algorithm idea of is not difficult , Basically is

1. Keep adding trees , Keep breaking up features to grow a tree , Add one tree at a time , It's actually learning a new function , To fit the residual predicted last time .

notes ：w_q(x) Is a leaf node q The scores of , It corresponds to all of them K A return tree （regression tree） Set , and f(x) For one of the return trees .

2. When we finish training get k tree , We want to predict the score of a sample , In fact, according to the characteristics of this sample , In each tree will fall a corresponding leaf node , Each leaf node corresponds to a fraction
3. Finally, we only need to add up the corresponding scores of each tree to be the predicted value of the sample .

obviously , Our goal is to make the prediction value of the tree group Try to be close to the real value , And have the ability to generalize as much as possible .

therefore , From a mathematical point of view, this is a functional optimization problem , Therefore, the objective function is simplified as follows ：

As you can see , This objective function is divided into two parts ： Loss function and regularization term . And The loss function reveals the training error （ The difference between the predicted score and the real score ）, The complexity of regularization definition .

For the above formula , It's the output of the whole accumulation model , Regularization term Is a function of the complexity of a tree , The smaller the value, the lower the complexity , The stronger the generalization ability , The expression is

T Represents the number of leaf nodes ,w Represents the score of a leaf node . Intuitively , The goal requires the prediction error to be as small as possible , And leaf nodes T As far as possible （γ Control the number of leaf nodes ）, Node value w Try not to be extreme （λ Control the score of leaf node is not too large ）, Prevent over fitting .

Insert a sentence , The general objective function contains the following two terms

among , error / The loss function encourages our model to fit the training data as much as possible , So that the final model will have less bias. The regularization term encourages simpler models . Because when the model is simple , The randomness of the results obtained from the finite data fitting is relatively small , It's not easy to over fit , It makes the prediction of the final model more stable .

4.2.1 Model learning and training error

say concretely , In the first part of the objective function    It means the first one Samples , ( − ) It means the first one    The prediction error of samples , Our goal, of course, is that the smaller the error, the better .

Similar before GBDT The routine ,xgboost It is also necessary to accumulate the scores of multiple trees to get the final predicted scores （ Every iteration , All on the basis of existing trees , Add a tree to fit the residual between the predicted result of the previous tree and the real value ）.

but , How do we choose what to add in each round Well ？ The answer is very direct , Pick one   To make our objective function as low as possible .

Just a little bit more , in consideration of The first t The model predictions of the wheel   =  front t-1 Wheel model prediction   +  , So the error function is written as ： ( ), The latter term is the regularization term .

For the formula of this error function , In the t Step , Is the real value , That is known , It can be done from the previous step t-1 In step add It is calculated that , It's also known in a sense , So model learning is .

The upper one Obj May be too abstract , We can think about being   It's the square error case （ amount to ）, At this time, our goal can be written as the following quadratic function （ The circled part of the graph represents the residual between the predicted value and the real value ）：

More general , What if the loss function is not a quadratic function ？ Using Taylor expansion , It's not quadratic. It's almost quadratic （ As you can see , First derivative is defined Second derivative and ）.

Boon boon , Pay attention ！ A lot of people may get stuck here , There are few articles on the Internet that explain clearly , Online in and July AI lab After Dr. Chen's discussion , The key to find out is to put Taylor's second order expansion terms and xgboost Make clear the corresponding relation of objective function , We can use Taylor's second-order expansion to approximate the objective function .

First , This is Taylor's second-order expansion

Corresponding to xgboost In the objective function of

Ignore the loss function   Medium （ Don't forget the above “ In the t Step , Is the real value , That is known  ”, Does not affect the following objective function pairs The partial derivative calculation of ）, Do a one-to-one correspondence ：

• Taylor's second order expansion f  Inside x Corresponding to... In the objective function ,
• f  in Corresponding to the objective function ,
• thus f  Yes x When we take the derivative , It corresponds to the objective function pair Finding partial derivatives

obtain ：

among ,,

Alas , thoroughly ！ however , Where did Taylor's second expansion come from ？

In mathematics , Taylor formula （ English ：Taylor's Formula） It is a formula that uses the information of a function at a certain point to describe its value nearby . This formula comes from Taylor's theorem of calculus （Taylor's theorem）, Taylor's theorem describes a differentiable function , If the function is smooth enough , In the case that the derivatives of a function at a certain point are known , Taylor's formula can use these derivatives as coefficients to construct a polynomial to approximate the value of the function in the neighborhood at this point , This polynomial is called the Taylor polynomial （Taylor polynomial）.

It is equivalent to telling us that the approximation of the original function can be made by using some degree terms of Taylor polynomials .

Taylor's theorem ：
set up n Is a positive integer . If defined in a containing a The function on the interval of f stay a point n+1 Sub differentiable , So for any of these intervals x, There are ：

The polynomials are called functions in a Taylor expansion at , remainder It's the remainder of Taylor's formula , yes The higher order is infinitesimal .

Next , Considering our first t The regression tree is based on the t-1 The residuals of the regression tree , amount to t-1 The value of trees It is known. . let me put it another way , The optimization of the objective function does not affect , It can be removed directly , The constant term can also be removed , Thus, a more unified objective function is obtained .

At this time , The objective function only depends on the first derivative of each data point on the error function And the second derivative （ I believe you have seen xgboost and gbdt It's different , The objective function retains the quadratic term of Taylor expansion ）.

The general guiding principle is to take office Google Readers of crab6789 said ：

In essence, assigning a sample to a leaf node corresponds to a obj, The optimization process is obj Optimize . That is to split nodes into different combinations of leaves , Different combinations correspond to different obj, All the optimizations revolve around this idea .

So far we've discussed the first part of the objective function ： Training error . Next we will discuss the second part of the objective function ： The regularization , That is, how to define the complexity of a tree .

4.2.2 The regularization ： The complexity of trees

First , Sort out a few rules

• Represented by leaf node set and leaf node score
• Each sample falls on a leaf node
• q(x) Presentation sample x On a leaf node ,wq(x) Is the score of this node , That is, the model predicted value of the sample

So when we Turn the tree into The structural part q and Leaf weight part w after , Structure function q Map the input to the index number of the leaf , and w Given the leaf fraction for each index number .

in addition , As shown in the figure below ,xgboost The complexity of trees consists of two parts ：

1. One is the number of leaf nodes in the tree T
2. One is the score of the leaf node on the tree w Of L2 Module square （ Yes w Conduct L2 Regularization , It is equivalent to increasing the score for each leaf node L2 smooth , The aim is to avoid over fitting ）

Under this new definition , We can transform the previous objective function as follows （ another , Don't forget ：

among Is defined as each leaf node  j The set of sample subscripts above  ,g It's the first derivative ,h It's the second derivative . This step is due to xgboost Two regular terms are added to the second part of the objective function , One is the number of leaf nodes (T), One is the score of leaf nodes (w).

thus , There are two kinds of additive objective functions

• One is  - > n（ Sample size ）
• One is  -> T（ Number of leaf nodes ）

This goal includes T Two independent univariate quadratic functions .

What's the key to understanding this derivation ？ In the and AI lab After Dr. Chen's discussion , It's about understanding the definition ： Is defined as each leaf node  j The set of sample subscripts above  , In this definition q(xi) The expression is ： Each sample value xi All can be done through functions q(xi) Mapping to a leaf node in the tree , By this definition, the two kinds of accumulation are unified together .

next , We can define

The final formula can be reduced to

Through to Derivation is equal to 0, You can get

And then put The optimal solution is substituted to get ：

4.3 Scoring function calculation

Obj Represents when we specify the structure of a tree , What's the maximum reduction we can make in our goals . We can call it structure fraction (structure score)

4.3.1  Split node

One of the interesting things is , We learned from the beginning to the end xgboost How to optimize 、 How to calculate , But what does a tree look like , We haven't seen . Obviously , The generation of a tree is divided into two nodes , And then it splits and ends up as a whole tree . So how to split the tree becomes the key to be discussed next .

How to split a leaf node ,xgboost In his original paper, the author gives two methods of splitting nodes

（1） The greedy method of enumerating all different tree structures

Now it's just a matter of knowing the structure of the tree , You can get the best score under the structure , How to determine the structure of a tree ？

One way to take it for granted is to ： Constantly enumerate the structures of different trees , Then use the scoring function to find out the optimal structure of the tree , Then add it to the model , Keep repeating this operation . And then again , You will realize that there are too many states to enumerate , Basically, it belongs to infinite species , What about ？

Let's try the greedy method , From the depth of the tree 0 Start , Each node traverses all the features , Such as age 、 Gender and so on , And then for a feature , First sort by the value in the feature , Then scan the feature linearly to determine the best segmentation point , Finally, all features are segmented , We choose what we call gain Gain The highest feature , and Gain How to calculate ？

Remember 4.2 At the end of the festival , Let's get the formula ？

let me put it another way , In the objective function G/(H+λ) part , Represents the contribution of each leaf node to the loss of the current model , Mix it up , obtain Gain The calculation expression of , As shown below ：

The first thing to notice is “ For a feature , First sort by the value in the feature ”, Here's an example .

For example, set a value a, And then enumerate all of them x < a、a  < x Such conditions （x Represents a characteristic, such as age age, hold age Sort from small to large ： Suppose you increase from left to right , More than a Put the small one on the left , Than a The big one is on the right ）, For a particular partition a, We're going to calculate a The derivatives on the left and right and .

For example, five people in total , In order of age , In the beginning, we had the following in total 4 There are three ways to divide ：

1. Divide the first person from the next four
2. Divide the first two from the last three
3. Divide the first three from the last two
4. Divide the front four people from the back one

Next , Top up 4 Each method of partition is calculated separately Gain, Let's see what kind of partition we get Gain If the value is the largest, which partition method should be selected , After calculation , Find the first one 2 There are three ways to divide “ The first two men are separated from the last three ” Got Gain Value is the largest , It means that this method of division is the most appropriate one at the first level of division .

let me put it another way , For all the features x, We just need to do a scan from left to right and we can enumerate all the gradients of segmentation and GL and GR. And then calculate Gain It's OK to calculate the score of each segmentation scheme .

And then the second layer will continue according to this method 、 The third level 、 The fourth level 、 The first N Division of layers .

The second thing to note is that the introduction of segmentation doesn't necessarily make things better , So we have a penalty term that introduces new leaves . The goal of optimization corresponds to the pruning of trees , When the gain brought by the introduced segmentation is less than a threshold γ  When , Ignore this segmentation .

let me put it another way , When we introduce a partition , Results the score obtained after segmentation - Undivided scores get too small （ For example, less than our minimum expectation threshold γ）, But the complexity is too high , It's not worth the loss , It's better not to divide . That is to say, the benefits of doing an action are not much greater than the disadvantages , To avoid complexity The attitude that more is better than less , Better not .

It's equivalent to finding that “ branch ” Not so good “ Regardless of ” In the case of post （ The gain is too small , Small to less than the threshold γ）, There will be 2 Leaf nodes exist on the same subtree .

The following is the algorithm in the paper

（2） The approximate algorithm

It's mainly about the big data , You can't calculate directly

Assign the sample from the root to the leaf node , It's just a permutation . Different combinations correspond to cost Different . For the best combination, you need try, It's impossible to go all out , That's why the greedy method came out . Not from the beginning to the end It depends on how the current node is best allocated . That's where the exact greddy Method , Later, I wanted to speed up histogram How to do it .

4.4 Summary ：Boosted Tree Algorithm

To sum up , As shown in the figure

Let's review the whole process again .

If a sample label Values for 4, So the first regression tree predicts 3, The second prediction is 1; Another set of regression trees , A prediction 2, A prediction 2, Then the latter kind of , Why? ？ The former situation , The first tree learned too much , Too close to 4, That means there is a greater risk of over fitting .

OK, It sounds beautiful , But how to achieve it , How does the above objective function relate to the actual parameters , Remember we said , The parameters of the regression tree ：

1. Choose which feature What about split nodes
2. The predicted value of the node （ It can't be done in such a crude and unreasonable way as to take the average value , At least it's a bit more advanced ）

The ultimate strategy is ： greedy + optimization （ Right , Quadratic optimization ） .

Popular explanation of greedy strategy ： That is, the decision-making time is based on the current goal optimization decision , To put it bluntly, it is the decision to maximize the immediate interests ,“ Short sightedness ” Strategy .

Here's how to use the greedy strategy , At first you had a bunch of samples , Put it on the first node , Now T=1,w How many? , I do not know! , It's the result , At this time, the predicted value of all samples is w, Bring in the sample label The number , here loss function Turn into

• If it's here l(w−yi) The error representation is squared error , So the above function is about w To find the minimum of the quadratic function , The minimum point is the predicted value of this node , The minimum function value is the minimum loss function .
• In essence , This is a quadratic function optimization problem ！ But what if the loss function is not quadratic ？ Taylor expansion , It's not quadratic. It's almost quadratic .

Come on , The next thing to do is to choose one feature Split into two nodes , Become a weak sapling , You need to ：

1. To determine the split feature,how？ The simplest is a crude enumeration / Exhausting （ Um. , Rough enough ）, And then choose loss function The one that works best ;
2. How to establish the node of w And the smallest loss function, Tell me out loud what to do ？ Yes , The maximum value of quadratic function （ Generally, there is a fixed way to calculate the maximum value of the second time , That is, the derivative is equal to 0 The point of ） . therefore , Select a feature split , Calculation loss function minimum value , Then choose one more feature split , I get another one loss function minimum value , You're done enumerating , Find the best , Split the tree , You get the saplings .

In a split , You can notice , Every time a node splits ,loss function Only the sample of this node will be affected , So every time it splits , Calculate the gain of splitting （loss function The amount of reduction ） Just focus on the sample of the node that you intend to split .

To make a long story short ,XGBoost Used and CART The idea of returning to the tree , Using greedy algorithms , Traversing all feature points of all features , The difference is that the objective function used is different . The specific method is that the value of the objective function after splitting is higher than the gain of the objective function of the single child leaf node , At the same time, in order to limit the growth of trees too deep , There's also a threshold , Only when the gain is greater than this threshold can we split .

Here's the set threshold

So as to continue to split , Form a tree , Make another tree , Each time, we take the optimal further split based on the last prediction / Build up trees , Is it a greedy strategy ？

There must be a stop condition for this kind of cyclic iteration , When will it stop ？ in short , Set the maximum depth of the tree 、 Stop growing when the sample weight sum is less than the set threshold to prevent over fitting . To be specific , be

1. When the gain brought by the introduced splitting is less than the set threshold , We can ignore this split , So not every split loss function The whole thing will increase , A bit of pre pruning , The threshold parameter is （ That is, the number of leaf nodes in the regular term T The coefficient of ）;
2. Stop building the decision tree when the tree reaches its maximum depth , Set a super parameter max_depth, Avoid learning local samples because the tree is too deep , So as to over fit ;
3. When the sum of sample weights is less than the set threshold, the tree building will stop . What does that mean , It involves a super parameter - Minimum sample weight and min_child_weight, and GBM Of min_child_leaf The parameters are similar to , But not exactly . There are too few samples of a leaf node , It also stops over fitting ;
4. Seems to have seen the largest number of trees …

### Postscript

Finally, I got a general idea of this screen swipe xgboost, To confirm a sentence I said before ： When you don't know something, you feel like learning something , Nine times out of ten, it's not that you're not smart enough , Nine times out of ten, the information you read is not popular enough 、 It's not easy to understand （ If it still doesn't work , Ask people ）.

The improvement process of the following article , For reference in understanding ：

1. 8.4 First edition in the morning , Through an easy to understand example of age prediction gbdt, because gbdt Is to understand xgboost The basis of ;
2. 8.4 Second edition in the afternoon ,xgboost There are many formulas in the derivation of , It's easy for beginners to fall into , After catching xgboost At the heart of ： Objective function , Sort it out xgboost The framework of the context of ;
3. 8.5 Third edition in the morning , The introduction of decision tree is optimized , For example, increase the introduction of information gain ;
4. 8.5 The fourth edition in the afternoon , Optimize the display of most formulas , For example, it used to be plain text , Now it is changed to LaTeX Picture shows ;
5. 8.6 Fifth edition in the morning , Optimize for booting Introduction to integrated learning , It has made the full text step by step ;
6. 8.6 The sixth edition of the evening , standard xgboost The formula expression of objective function , And combs the full text many details 、 The formula ;
7. 9.1 7th edition in the morning , perfect 4.3.1 In the festival xgboost How trees are divided , To make it clearer ;
8. 19 year 1.9 Eighth Edition , perfect 4.3.1 The split node is described in the section , To make the logic clearer 、 The writing is more popular ;
9. 19 year 1.10 The Ninth Edition , The first 3 Add an example of predicting age in part , To explain in a more general way GBDT;
10. 19 year 1.14 Tenth edition , Let's sum up Taylor's second order expansion xgboost The corresponding relation of objective function should be written clearly , In order to understand more smoothly .

After reading the full text , You'll find understanding xgboost There are three key points ：①4.2.1 In the festival , Clear understanding xgboost The one-to-one correspondence between the objective function term and the Taylor expansion term ,②4.2.2 Section of the calculation of the score w It's a mathematical skill to use when ,③4.3.1 The tree splitting algorithm shown in section . Clear up these three points , Understand xgboost There are no more obstacles .

July、 On the evening of August 6, 2018 ~ In the evening of January 14, 2019 .

https://chowdera.com/2020/12/20201207124140818w.html