# Decision tree: ID3, C4.5, cart

2021-01-23 20:18:04

Three evaluation indexes

A series of decision tree algorithms are all in a given set S, Suppose that according to a certain characteristic A division , Pick this feature A According to the best partition value , Get some sets after partition S1,S2.. And the probability of data being divided into corresponding sets is P1,P2..., In this way, we can use evaluation indicators such as information gain / Information gain rate /GINI The index calculates all the corresponding cumulative values after the division

In this way, for each assumed feature, there is a cumulative evaluation value after division , Select the best feature as the S The optimal feature partition of a set .

Information gain （ID3）

Select the attribute corresponding to the maximum information gain to divide （ Because the greater the information gain , The greater the ability to distinguish between samples , The more representative , Obviously, it's a top-down greedy strategy ）.

（1）. When we use information gain to select attributes, we tend to select attribute values with more branches , That is, the attribute with more values

（2）. Can't handle coherence properties

Information gain rate (C4.5)

Select the attribute corresponding to the maximum information gain rate to divide （ Because the greater the information gain , The greater the ability to distinguish between samples , The more representative , Obviously, it's a top-down greedy strategy ）.

C4.5 Overcome ID3 Of 2 individual shortcoming ：

（1）. When we use information gain to select attributes, we tend to select attribute values with more branches , That is, the attribute with more values

（2）. Can't handle coherence properties

The disadvantage is that ： In the process of constructing trees , The data set needs to be scanned and sorted several times , This leads to the inefficiency of the algorithm .

Information gain rate punishes the attributes with more values by using the split information of attributes , Solve the problem that information gain tends to select attributes with more values when selecting attributes , But it can also lead to over punishment and choosing attributes with little information （ The information gain is small , At this time, it is possible that the split information of this attribute is also small , It makes the information gain rate larger ）

The improvement of information gain rate ：

When calculating the information gain of each attribute , We only consider those properties whose information gain exceeds the average , Then compare their information gain rates for these attributes .

GINI Index (CART Back to the tree )

GINI Index is used to measure the clutter degree of a certain value of an attribute, that is, the clutter degree of a subset of the attribute .GINI The smaller the index is , The categories corresponding to the divided subsets are relatively concentrated in a certain category ,GINI The bigger the index, the more diverse the categories are .

Corresponding GINI gain

We are choosing the best attribute partition, which is to find GINI_Gain The smallest attribute .

The decision tree is mainly divided into two parts , One is to build a decision tree , The other is to prevent over fitting , The decision tree needs to be pruned

The construction process is based on some evaluation index （GINI Index , Entropy and so on ） Select the best attribute in the data set to divide , Look at this property. Each value is taken as a branch , That is, the value of the attribute is the same on each branch , So each branch forms a new data set , Repeated recursion for optimal attribute partition , Until the dataset satisfies some kind of restrictive constraint （ Or the dataset contains only one category ） stop it , Put on the corresponding category label .

Pruning process ： Divided into front pruning （ In the process of building a decision tree, some restrictive constraints are used to terminate the tree ahead of time ） And after pruning （ Build a decision tree , Then cut off some relatively unimportant branches ）, Here are a few ways ：

1）、REP （Reduced-Error Pruning, Error rate reduced pruning ）

2）、PEP （Pessimistic Error Pruning, Pessimistic mistake pruning ）

C4.5 Pruning used in the algorithm , It's a post pruning algorithm .

3）、CCP （Cost-Complexity Pruning, Cost complexity )

https://chowdera.com/2021/01/20210123201637311l.html