当前位置:网站首页>Detailed explanation of gradient lifting tree (gbdt): examples of classification

Detailed explanation of gradient lifting tree (gbdt): examples of classification

2020-12-07 14:20:54 osc_ mwfx0ffs

stay 2006 year 12 The month IEEE At the International Conference on Data Mining (ICDM, International Conference on Data Mining), Experts at the meeting selected the top ten data mining algorithms at that time ( top 10 data mining algorithms ), See literature 【1】. As ensemble learning (Ensemble learning) An important representative of AdaBoost Selected . But based on Boosting Thought design algorithm , It is often used with AdaBoost Another algorithm for comparison is Gradient Boost. It is one of the best algorithms to fit the real distribution in the traditional machine learning algorithm , In the past few years, deep learning has not yet become popular ,Gradient Boost Scan almost all kinds of data mining (Data mining) Or knowledge discovery (Knowledge discovery) competition .

 

Gradient lifting tree (GBDT, Gradient Boosted Decision Trees), Or called Gradient Tree Boosting, It is a decision tree based learner , With Boost For the framework of the additive model of Integrated Learning Technology . Previous articles 【2】 in , We've seen the use of GBDT The specific method of regression analysis , I hope readers can read this article on this basis .

 

We already know ,GBDT Can be used to do regression , It can also be used for classification . This article will focus on the use of GBDT The specific method of classification . As an example , The training dataset we're going to use (Train Set) As shown in the following table , The age of the audience 、 Do you like popcorn , And their favorite colors make up the feature vector , And whether each audience likes to watch sci-fi movies is used as a label for classification .

It's like before we use GBDT It's the same when you do the return , For a start , First, a single node decision tree is constructed as the initial estimation or prediction for each individual . When I was doing the comeback , This step is to average , And now what we're going to do is sort the task , So the value used as the initial estimate is log(odds), This is the same logic used in regression , You can refer to my previous article for details 【3】 and 【4】.

How can we use this value to make classification prediction ? Recall what you did when you went back to logic , We turn it into a probability value , say concretely , It's using Logistic Function to log(odds) Values are converted into probabilities . therefore , You have to give an audience , The probability that he likes science fiction movies is

Be careful , For the sake of calculation , We keep one significant digit after the decimal point , But in fact log(odds) And the probability of watching science fiction movies 0.7 It is only after approximate calculation that it is equal by coincidence , There is no direct connection between them .

 

Now we know that given an audience , The probability that he likes science fiction movies is 0.7 , The probability is greater than 50%, So we finally concluded that he liked science fiction movies . What about this initial guess or estimate ? Now this is a very rough estimate . On the training dataset , Yes 4 Personal predictions are correct , And the other two were wrong . In other words, the current model is not very ideal in the training data set . We can use pseudo residuals to measure how far the original estimate is from the real situation . As we did in regression analysis earlier , The pseudo residual here is the difference between the observed value and the predicted value .

As shown in the figure above , Two red dots , Two viewers in the dataset who don't like science fiction , Or the probability that they like science fiction is 0. Similarly , Four blue dots , Represents the four viewers in the dataset who like watching science fiction , Or the probability that they like science fiction is 1. The red and blue dots are the actual observations , And the gray dotted line is our prediction . Therefore, it is easy to calculate the pseudo residuals of each data point as follows :

The next thing to do is to do the same thing as before , In other words, the decision tree is constructed based on the eigenvector to predict the pseudo residuals given in the table above . So we get the decision tree below . Be careful , We need to limit the number of leaves allowed in the decision tree ( There are similar restrictions in homing ), In this case, we set the upper limit to be 3, After all, the data size in this example is very small . But in practice , When faced with a large data set , The maximum number of leaves allowed is usually set at 8 To 32 Between the two . In the use of scikit-learn In the toolbox , Instantiate the class sklearn.ensemble.GradientBoostingClassifier when , You can specify max_leaf_nodes To control the value .

But how to use this newly built decision tree is relatively complicated . be aware , The initial estimate log(odds)≈0.7 It's a logarithmic likelihood , The pseudo residuals given by the leaf nodes of the new decision tree are based on the probability value , So they can't be added directly . Some transformation is essential , say concretely , Using GBDT When doing classification , We need to use the following transformation to further process the values on each leaf node .

The derivation of the above transformation involves some mathematical details , We will leave this point to be explained in detail in the following articles . Now? , We are doing the calculation based on the above formula : for example , There is only one value for the first one -0.7 In terms of leaf nodes of , Because there's only one value , So the addition sign in the above formula can be ignored , That is to say

Be careful , Because before building this decision tree , Our last decision tree , The probability given by all viewers to like science fiction is 0.7, So in the above formula Previous Probability Then the value is brought in . therefore , Replace the value in the leaf node with -3.3.

 

Next , Calculation contains 0.3 and -0.7 The leaf node of these two values . So there is

Be careful , Because the leaf node contains two pseudo residuals , Therefore, the addition part of the denominator is to add once to the corresponding result of each pseudo residual . in addition , at present Previous Probability It's the same for every pseudo residual ( Because there was only one node in the last decision tree ), But when we generate the next decision tree, it's not the same .

 

Similarly , The calculation results of the last node are as follows

So now the decision tree looks like this

Now? , Based on the previous decision tree , And the new decision tree that we just got , You can predict whether the audience will like to watch science fiction . It's the same with the previous comeback , We're going to use a learning rate Scale the new decision tree . The value we use here is 0.8, For the sake of demonstration convenience, the values taken here are relatively large . In the use of scikit-learn In the toolbox , Instantiate the class sklearn.ensemble.GradientBoostingClassifier when , You can specify parameters by learning_rate To control the learning rate , The default value for this parameter is 0.1.

For example, we are going to calculate the first audience in the table above log(odds), Available 0.7 + (0.8×1.4) = 1.8. therefore , utilize Logistic The function calculates the probability

be aware , Our previous probability estimate of whether the first audience would like to watch science fiction is 0.7, Now it is 0.9, Obviously, our model is a small step in a better direction . In this way , Next, calculate the probability that the rest of the audience like watching science fiction movies one by one , Then we calculate the pseudo residuals , We can get the results in the table below

Next , The decision tree is constructed based on the eigenvector to predict the pseudo residuals given in the table above . So we get a decision tree like this .

Then according to the transformation used before, the value of each leaf node is further processed , That is to get the final output of each leaf node .

for example , We calculate the output of the leaf node corresponding to the second row of data in the table shown above

\frac{0.5}{0.5\times (1-0.5)}=2

among Previous Probability The probability prediction values in the above table are brought in 0.5.

Let's calculate a slightly more complicated leaf , As shown in the figure above , Four of the audience's predictions point to the leaf node . So there is

So the output of the leaf node is 0.6. We can also see from this step that , The denominator of each pseudo residual may not be consistent , This is because The last prediction probability It's different . After calculating the corresponding output of all leaf nodes, a new decision tree is obtained as follows :

Now? , We can put all that we've got together , As shown in the figure below . In the beginning , We have a tree with only one node , On this basis, we get a second decision tree , Now there's a new decision tree , All of these newly generated decision trees go through Learning Rate Zoom , And then add it up to the first single node tree .

Next , Based on all existing decision trees , New pseudo residuals can be calculated . Be careful , The first pseudo residuals are only calculated from the initial estimates . The second pseudo residual is based on the initial estimate , Together with the first decision tree , The third pseudo residual to be calculated is based on the initial estimate , Together with the first 、 Two decision trees together . and , Every time a new decision tree is introduced , The pseudo residuals will gradually decrease . The pseudo residuals decrease gradually , It means that the model is gradually approaching in a good direction .

 

For better results , We will repeatedly perform the process of calculating pseudo residuals and building a new decision tree , Until the change of pseudo residuals is no longer significantly greater than a certain value , Or it has reached the pre-set upper limit of the number of decision trees . In the use of scikit-learn In the toolbox , Instantiate the class GradientBoostingClassifier when , You can specify n_estimators To control the upper limit of the number of decision trees , The default value for this parameter is 100.

 

For demonstration purposes , Now suppose , We've got the final GBDT, It consists of only the three decision trees mentioned above ( An initial tree and two subsequent decision trees ). If you have an age 25 year , I like popcorn , People who like green , Does he like science fiction movies ? Let's run this eigenvector on three decision trees , Get the value of the leaf node , Again by Learning Rate Add together after action , So there is 0.7+(0.8\times1.4)+(0.8\times0.6)=2.3. Notice this is one log(odds), So we use Logistic Convert it into a function of probability , have to 0.9. therefore , We decided that the audience liked science fiction . That's using what you've trained GBDT The specific method of classification .

 

In subsequent articles , We will also explain in detail Gradient Boost The mathematical principle of , At that time, readers will have a deeper understanding of the reason why the algorithm is so designed .

 


reference

 

* The examples in this paper are mainly based on and adapted from the literature 【5】, The literature 【6】 Provided in scikit-learn In the use of GBDT A simple example of classification prediction .

【1】Wu, X., Kumar, V., Quinlan, J.R., Ghosh, J., Yang, Q., Motoda, H., McLachlan, G.J., Ng, A., Liu, B., Philip, S.Y. and Zhou, Z.H., 2008. Top 10 algorithms in data mining. Knowledge and information systems, 14(1), pp.1-37.

【2】 Gradient lifting tree (GBDT) The return of detailed explanation

【3】 Blog post links

【4】 Blog post links

【5】Gradient Boost for Classification

【6】https://scikit-learn.org/stable/modules/ensemble.html#gradient-tree-boosting

 

版权声明
本文为[osc_ mwfx0ffs]所创,转载请带上原文链接,感谢
https://chowdera.com/2020/12/20201207141706975o.html