Three evaluation indexes

** A series of decision tree algorithms are all in a given set S, Suppose that according to a certain characteristic A division , Pick this feature A According to the best partition value , Get some sets after partition S1,S2.. And the probability of data being divided into corresponding sets is P1,P2..., In this way, we can use evaluation indicators such as information gain / Information gain rate /GINI The index calculates all the corresponding cumulative values after the division **

** In this way, for each assumed feature, there is a cumulative evaluation value after division , Select the best feature as the S The optimal feature partition of a set .**

** Information gain （ID3）**

** **

** Select the attribute corresponding to the maximum information gain to divide （ Because the greater the information gain , The greater the ability to distinguish between samples , The more representative , Obviously, it's a top-down greedy strategy ）.**

ID3 Of 2 Disadvantages ：

（1）. When we use information gain to select attributes, we tend to select attribute values with more branches , That is, the attribute with more values

（2）. Can't handle coherence properties

** **

** Information gain rate (C4.5)**

** **

** Select the attribute corresponding to the maximum information gain rate to divide （ Because the greater the information gain , The greater the ability to distinguish between samples , The more representative , Obviously, it's a top-down greedy strategy ）.**

C4.5 Overcome ID3 Of 2 individual shortcoming ：

（1）. When we use information gain to select attributes, we tend to select attribute values with more branches , That is, the attribute with more values

（2）. Can't handle coherence properties

The disadvantage is that ： In the process of constructing trees , The data set needs to be scanned and sorted several times , This leads to the inefficiency of the algorithm .

Information gain rate punishes the attributes with more values by using the split information of attributes , Solve the problem that information gain tends to select attributes with more values when selecting attributes , But it can also lead to over punishment and choosing attributes with little information （ The information gain is small , At this time, it is possible that the split information of this attribute is also small , It makes the information gain rate larger ）

** The improvement of information gain rate ：**

** When calculating the information gain of each attribute , We only consider those properties whose information gain exceeds the average , Then compare their information gain rates for these attributes .**

** **

**GINI Index (CART Back to the tree )**

**GINI**** Index is used to measure the clutter degree of a certain value of an attribute, that is, the clutter degree of a subset of the attribute .GINI The smaller the index is , The categories corresponding to the divided subsets are relatively concentrated in a certain category ,GINI The bigger the index, the more diverse the categories are .**

** Corresponding GINI gain **

** We are choosing the best attribute partition, which is to find GINI_Gain The smallest attribute .**

The decision tree is mainly divided into two parts , One is to build a decision tree , The other is to prevent over fitting , The decision tree needs to be pruned

The construction process is based on some evaluation index （GINI Index , Entropy and so on ） Select the best attribute in the data set to divide , Look at this property. Each value is taken as a branch , That is, the value of the attribute is the same on each branch , So each branch forms a new data set , Repeated recursion for optimal attribute partition , Until the dataset satisfies some kind of restrictive constraint （ Or the dataset contains only one category ） stop it , Put on the corresponding category label .

Pruning process ： Divided into front pruning （ In the process of building a decision tree, some restrictive constraints are used to terminate the tree ahead of time ） And after pruning （ Build a decision tree , Then cut off some relatively unimportant branches ）, Here are a few ways ：

1）、REP （Reduced-Error Pruning, Error rate reduced pruning ）

2）、PEP （Pessimistic Error Pruning, Pessimistic mistake pruning ）

C4.5 Pruning used in the algorithm , It's a post pruning algorithm .

3）、CCP （Cost-Complexity Pruning, Cost complexity )