# Decision tree algorithm theory

2020-11-09 13:43:59

WeChat official account ： Manon charging station pro

#### 1, What is a decision tree ？

Decision tree is a machine learning algorithm , We can use decision trees to deal with classification problems . Decision tree decision making （ classification ） The process can be expressed in an inverted tree structure , So it's called decision tree .

For example, we depend on the weather Whether it's sunny or not and Is it windy To decide whether to play football or not ？ When it's sunny and windless , We just went to play football .

here , The decision process can be represented in a tree structure , as follows ： This is the simplest decision tree , We can use it to judge whether we want to play football . At the top is the root node of the tree , At the bottom is the leaf node of the tree . In the box is the process of judgment , In the oval is the result of the decision .

Of course , In actual use , The judgment condition is not so simple , And we won't let us draw our own pictures by hand . Practical application , We'll make the program automatic , This decision tree is constructed from a set of sample data , This program automatically builds the decision tree process is machine learning The process of .

The final decision tree is the result of machine learning , be called Model . Last , We can put some... Into the model Property conditions , Let the model give us Judge the result . #### 2, How to build a decision tree ？

For example, we have the following data sets ：

Serial number Conditions : It's sunny ? Conditions : Is it windy ? result : To play football ?
1 yes no Go to
2 yes yes Don't go to
3 no yes Don't go to
4 no no Don't go to

You can see in this table that 4 That's ok （ The first row header doesn't count ）,4 Column data .

Generally in machine learning , The last column is called The goal is (target), The first column is called features (features).

We're going to use this data set , To build a decision tree , So how to build ？

First , You need to determine which attribute to use as the first criterion , It's to judge first It's sunny , Or judge first Is it windy ？ That is to say , Let the weather be the root node of the tree , Or let the wind be the root node of the tree ？

To solve this problem, we need to use Information entropy and Information purity The concept of , Let's first look at what information entropy is .

#### 3, What is information entropy ？

1948 year , Claude · Xiangnong is in his paper “ The mathematical principle of communication ” Mentioned in the Information entropy （ It's usually used H Express ）, The unit for measuring information entropy is The bit .

That is to say , The amount of information can be quantified . The amount of information in a piece is related to the uncertainty of information , It can be said that , The amount of information is equal to the amount of uncertainty （ The uncertainty of information ）.

The formula of information entropy is as follows ： The meaning of the formula is ：

• Things to be classified can be divided into several categories , there `n` It's the number of categories .
• `H(X)` It's entropy , The mathematical meaning is , All categories contain information expectations .
• `-㏒p(Xì)` Represents the information value of the symbol ,`p(Xì)` Is the probability of choosing the classification .
• Formula `log` General with 2 Bottom .

All in all , Just to know , The amount of information can be calculated by mathematical formula , In terms of information theory, it's called information entropy . The greater the entropy of information , The more information there is .

##### 3.1, Calculate entropy of information

So let's calculate the information entropy of the above table data . We only focus on “ result ” That column ：

result : To play football ?
Go to
Don't go to
Don't go to
Don't go to

According to the form , We can know , All categories share 2 Kind of , That is to say “ Go to ” and “ Don't go to ”,“ Go to ” There is 1 Time ,“ Don't go to ” There is 3 Time .

Separate calculation “ Go to ” and “ Don't go to ” Probability of occurrence ：

• `P( Go to ) = 1 / 4 = 0.25`
• `P( Don't go to ) = 3 / 4 = 0.75`

then , Calculate according to the formula of entropy “ Go to ” and “ Don't go to ” The entropy of information , among log With 2 Bottom ：

• `H( Go to ) = 0.25 * log 0.25 = -0.5`
• `H( Don't go to ) = 0.74 * log 0.75 = -0.31127812445913283`

therefore , The amount of information contained in the whole table is ：

• `H( form ) = -(H( Go to ) + H( Don't go to )) = 0.81127812445913283`
##### 3.2, Using code to realize the calculation of information entropy

The process of calculating information entropy is used `Python` Code implementation , as follows ：

``````import math

#  This function is used to calculate the information entropy of a set of data
# data_set  It's a list , Represents a set of data
# data_set  The elements of data  It's also a list
def calc_ent(data_set):
labels = {} #  Used to count each label  The number of

for data in data_set:
label = data[-1]	#  Only the last element is used for calculation
if label not in labels:
labels[label] = 0

labels[label] += 1

ent = 0 #  entropy
n = len(data_set)   #  Number of data

#  Calculate entropy of information
for label in labels:
prob = float(labels[label]) / n # label  Probability
ent -= prob * math.log(prob, 2) #  According to the formula of information entropy

return ent
``````

This function is used to calculate the information entropy of the table ：

``````#  Convert the table to  python  list
# "yes"  Express " Go to "
# "no"  Express " Don't go to "
data_set = [['yes'], ['no'], ['no'], ['no']]
ent = calc_ent(data_set)
print(ent)	# 0.811278124459
``````

so , The result of code calculation is 0.811278124459, With the result of our manual calculation 0.81127812445913283 It's the same （ The number of decimal places reserved is different ）.

#### 4, What is information purity ？

The purity of information is inversely proportional to information entropy ：

• The greater the entropy of information , The more information , The more messy the information is , The lower the purity .
• The less entropy of information , The less information , The more regular the information is , The higher the purity .

classical “ Impure ” There are three algorithms , Namely ：

• Information gain , namely `ID3 Algorithm `,`Information Divergence`, The algorithm consists of `Ross Quinlan` On 1975 in , Can be used to generate a binary tree or a multi tree .
• ID3 The algorithm will select the attribute with the largest information gain as the partition of attributes .
• Information gain rate , namely `C4.5 Algorithm `, yes `Ross Quinlan` stay ID3 Based on the improved algorithm , Can be used to generate a binary tree or a multi tree .
• Gini is not pure , namely `CART Algorithm `,`Classification and Regression Trees`, Chinese is Categorizing regression trees .
• It can be used for classification number , It can also be used for regression trees . Classification trees use The gini coefficient Do judgment , Return to the tree with deviation Do judgment .
• The Gini coefficient itself reflects the uncertainty of the sample .
• When the Gini coefficient is smaller , The smaller the difference between the samples , The less uncertainty .
• `CART Algorithm ` The attribute with the smallest Gini coefficient will be selected as the attribute partition .

Information gain is one of the simplest algorithms , The latter two are derived from it . In this article , We're just going to go into details about information gain .

The gini coefficient It is a common indicator used to measure the income gap of a country in economics . When the Gini coefficient is greater than `0.4` When , It shows that there are great differences in wealth . The Gini coefficient is in `0.2-0.4` The distribution is reasonable , The wealth gap is not big .

#### 5, What is information gain ？

Information gain is , Before and after partitioning a dataset according to a property , Changes in the amount of information .

The formula of information gain is as follows ： The meaning of the formula ：

• In short ：`G = H( Parent node ) - H( All child nodes )`
• That is to say ： The information entropy of the parent node subtracts the information entropy of all the child nodes .
• The information entropy of all the child nodes will be calculated according to the probability of the child node appearing in the parent node , It's called Normalized information entropy .

The purpose of information gain is , The purity improvement after partitioning the data set , That is, the decrease of information entropy . If the data set is divided according to a certain attribute , Can obtain the maximum information gain , So this property is the best choice .

therefore , We want to find the root node , We need to calculate the information gain of each attribute as the root node , Then get the attribute with the most information gain , It's the root node .

##### 5.1, Calculated information gain

For the sake of convenience , I put the form above here ：

Serial number Conditions : It's sunny ? Conditions : Is it windy ? result : To play football ?
1 yes no Go to
2 yes yes Don't go to
3 no yes Don't go to
4 no no Don't go to

We already know that , Information gain is equal to the difference of information entropy before and after dividing according to a certain attribute .

We already know the entropy of information before this table is partitioned , That's what we calculated above ：

• `H( form ) = 0.81127812445913283`.

Next , We calculate according to “ It's sunny ” Divided information gain . according to “ It's sunny ” After dividing, there are two tables .

form 1,“ It's sunny ” The value of is “ yes ”：

Serial number Conditions : It's sunny ? Conditions : Is it windy ? result : To play football ?
1 yes no Go to
2 yes yes Don't go to

Classification has something in common 2 Kind of , That is to say “ Go to ” and “ Don't go to ”,“ Go to ” There is 1 Time ,“ Don't go to ” There is 1 Time .

therefore ,“ Go to ” and “ Don't go to ” The probability of occurrence is 0.5：

• `P( Go to ) = P( Don't go to ) = 1 / 2 = 0.5`

then ,“ Go to ” and “ Don't go to ” The entropy of information , among log With 2 Bottom ：

• `H( Go to ) = H( Don't go to ) = 0.5 * log 0.5 = -0.5`

therefore , form 1 The amount of information contained is ：

• `H( form 1) = -(H( Go to ) + H( Don't go to )) = 1`

form 2,“ It's sunny ” The value of is “ no ”：

Serial number Conditions : It's sunny ? Conditions : Is it windy ? result : To play football ?
3 no yes Don't go to
4 no no Don't go to

All the classifications are only 1 Kind of , yes “ Don't go to ”. therefore ：

• `P( Don't go to ) = 1`

then ,“ Don't go to ” The entropy of information , among log With 2 Bottom ：

• `H( Don't go to ) = 1 * log 1 = 0`

therefore , form 2 The amount of information contained is ：

• `H( form 2) = 0`

The total data share 4 Share ：

• form 1 There is 2 Share , The probability of 2/4 = 0.5
• form 2 There is 2 Share , The probability of 2/4 = 0.5

therefore , Finally according to “ It's sunny ” The information gain of the partition is ：

• `G( It's sunny ) = H( form ) - (0.5*H( form 1) + 0.5*H( form 2)) = H( form ) - 0.5 = 0.31127812445913283.`
##### 5.2,ID3 Disadvantages of the algorithm

When we calculate more information gain , You'll find that ,ID3 The algorithm tends to select the attribute with more values as （ root ） node .

But sometimes , Some properties don't affect the result （ Or it has little effect on the results ）, Then use ID3 The selected attributes are not appropriate .

To improve ID3 This shortcoming of the algorithm ,C4.5 The algorithm came into being .

C4.5 Algorithm to ID3 The improvements of the algorithm include ：

• Using information gain rate , It's not information gain , avoid ID3 The algorithm tends to choose the attribute with more values .
• Added pruning technology , prevent ID3 The occurrence of over fitting in the algorithm .
• Discretization of continuous attributes , bring C4.5 The algorithm can handle the case of continuous attributes , and ID3 Can only handle discrete data .
• Handling missing values ,C4.5 It can also deal with incomplete data sets .

Of course C4.5 The algorithm is not without shortcomings , because C4.5 The algorithm needs to scan the data set many times , So the efficiency of the algorithm is relatively low . There's no more discussion here C4.5 Algorithm .

The next part will introduce how to use decision tree to solve practical problems .

Welcome to follow the author's official account , Get more technical dry goods . 