当前位置:网站首页>Decision tree: ID3, C4.5, cart

Decision tree: ID3, C4.5, cart

2021-01-23 20:18:04 Light as breeze

Three evaluation indexes

 

A series of decision tree algorithms are all in a given set S, Suppose that according to a certain characteristic A division , Pick this feature A According to the best partition value , Get some sets after partition S1,S2.. And the probability of data being divided into corresponding sets is P1,P2..., In this way, we can use evaluation indicators such as information gain / Information gain rate /GINI The index calculates all the corresponding cumulative values after the division

 

 

 

In this way, for each assumed feature, there is a cumulative evaluation value after division , Select the best feature as the S The optimal feature partition of a set .

Information gain (ID3)

 

 

 

 

  

Select the attribute corresponding to the maximum information gain to divide ( Because the greater the information gain , The greater the ability to distinguish between samples , The more representative , Obviously, it's a top-down greedy strategy ).

ID3 Of 2 Disadvantages :

(1). When we use information gain to select attributes, we tend to select attribute values with more branches , That is, the attribute with more values

(2). Can't handle coherence properties

 

Information gain rate (C4.5)

 

 

 

 

Select the attribute corresponding to the maximum information gain rate to divide ( Because the greater the information gain , The greater the ability to distinguish between samples , The more representative , Obviously, it's a top-down greedy strategy ).

C4.5 Overcome ID3 Of 2 individual shortcoming :

(1). When we use information gain to select attributes, we tend to select attribute values with more branches , That is, the attribute with more values

(2). Can't handle coherence properties

The disadvantage is that : In the process of constructing trees , The data set needs to be scanned and sorted several times , This leads to the inefficiency of the algorithm .

Information gain rate punishes the attributes with more values by using the split information of attributes , Solve the problem that information gain tends to select attributes with more values when selecting attributes , But it can also lead to over punishment and choosing attributes with little information ( The information gain is small , At this time, it is possible that the split information of this attribute is also small , It makes the information gain rate larger )

The improvement of information gain rate :

When calculating the information gain of each attribute , We only consider those properties whose information gain exceeds the average , Then compare their information gain rates for these attributes .

 

GINI Index (CART Back to the tree )

GINI Index is used to measure the clutter degree of a certain value of an attribute, that is, the clutter degree of a subset of the attribute .GINI The smaller the index is , The categories corresponding to the divided subsets are relatively concentrated in a certain category ,GINI The bigger the index, the more diverse the categories are .

Corresponding GINI gain

 

 

We are choosing the best attribute partition, which is to find GINI_Gain The smallest attribute .

 

The decision tree is mainly divided into two parts , One is to build a decision tree , The other is to prevent over fitting , The decision tree needs to be pruned

The construction process is based on some evaluation index (GINI Index , Entropy and so on ) Select the best attribute in the data set to divide , Look at this property. Each value is taken as a branch , That is, the value of the attribute is the same on each branch , So each branch forms a new data set , Repeated recursion for optimal attribute partition , Until the dataset satisfies some kind of restrictive constraint ( Or the dataset contains only one category ) stop it , Put on the corresponding category label .

Pruning process : Divided into front pruning ( In the process of building a decision tree, some restrictive constraints are used to terminate the tree ahead of time ) And after pruning ( Build a decision tree , Then cut off some relatively unimportant branches ), Here are a few ways :

1)、REP (Reduced-Error Pruning, Error rate reduced pruning )

2)、PEP (Pessimistic Error Pruning, Pessimistic mistake pruning )

C4.5 Pruning used in the algorithm , It's a post pruning algorithm .

3)、CCP (Cost-Complexity Pruning, Cost complexity )

 

版权声明
本文为[Light as breeze]所创,转载请带上原文链接,感谢
https://chowdera.com/2021/01/20210123201637311l.html

随机推荐