Decision tree construction
Use ID3 The algorithm builds the decision tree
1 ID3 Algorithm

ID3 The core of the algorithm is to select features according to the information gain criterion at each node of the decision tree , Build the decision tree recursively . The way to do it is : From the root (root node) Start , Calculate the information gain of all possible features for nodes , The feature with the largest information gain is selected as the feature of the node , According to the different values of the feature, the sub nodes are established ; Then call the above method recursively to the child node , Build decision tree ; Until the information gain of all features is very small or no features can be selected , Finally, a decision tree is obtained .ID3 It's equivalent to using the maximum likelihood method to select the probability model .
In the use of ID3 Before building a decision tree , Let's analyze the data again .

ID Age There's work Have your own house Credit situation Category ( Is it a loan )
1 youth no no commonly no
2 youth no no good no
3 youth yes no good yes
4 youth yes yes commonly yes
5 youth no no commonly no
6 middle-aged no no commonly no
7 middle-aged no no good no
8 middle-aged yes yes good yes
9 middle-aged no yes very nice yes
10 middle-aged no yes very nice yes
11 The elderly no yes very nice yes
12 The elderly no yes good yes
13 The elderly yes no good yes
14 The elderly yes no very nice yes
15 The elderly no no commonly no

Using the results of the last article , Due to features A3( Have your own house ) The maximum information gain of , So select features A3 As a feature of the root node . It will train set D Divided into two subsets D1(A3 The value is ” yes ”) and D2(A3 The value is ” no ”). because D1 Only sample points of the same kind , So it becomes a leaf node , The class of the node is marked as “ yes ”.
Yes D2 We need to start with the characteristics A1( Age ),A2( There's work ) and A4( Credit situation ) Select a new feature from the list , Calculate the information gain of each feature :

g(D2,A1) = H(D2) - H(D2 | A1) = 0.251
g(D2,A2) = H(D2) - H(D2 | A2) = 0.918
g(D2,A3) = H(D2) - H(D2 | A3) = 0.474
According to the calculation , Select the feature with maximum information gain A2( There's work ) As a feature of nodes . because A2 There are two possible values , From this node two child nodes are derived : A correspondence ” yes ”( There's work ) The child node of , contain 3 Samples , They belong to the same category , So this is a leaf node , Class is marked with ” yes ”; The other is corresponding to ” no ”( No job ) The child node of , contain 6 Samples , They belong to the same category , So this is also a leaf node , Class is marked with ” no ”.
This creates a decision tree , The decision tree uses only two features ( There are two internal nodes ), The decision tree generated is shown in the figure below .
 Picture description here
We use ID3 Algorithm , Build a decision tree by calculation , Next , Let's take a look at how to implement generation .

Write code to build a decision tree

We use dictionaries to store the structure of the decision tree , For example, the decision tree we analyzed in the last section , It can be expressed as :
{‘ Have your own house ’: {0: {‘ There's work ’: {0: ‘no’, 1: ‘yes’}}, 1: ‘yes’}}
Create a function majorityCnt Statistics classList Most of the elements that appear here in ( Class label ), Create a function createTree Used to recursively build decision trees . Write the code as follows :

# -*- coding: UTF-8 -*-
from math import log
import operator

"""
 Function description : Calculate the empirical entropy of a given data set ( Shannon entropy )

Parameters:
    dataSet -  Data sets 
Returns:
    shannonEnt -  Experience in entropy ( Shannon entropy )
"""
def calcShannonEnt(dataSet):
    numEntires = len(dataSet)                        # Returns the number of rows in the dataset 
    labelCounts = {}                                # Save each label (Label) A dictionary of the number of occurrences 
    for featVec in dataSet:                            # Make statistics on each group of eigenvectors 
        currentLabel = featVec[-1]                    # Extract tags (Label) Information 
        if currentLabel not in labelCounts.keys():    # If the label (Label) There is no dictionary for counting times , Add in 
            labelCounts[currentLabel] = 0
        labelCounts[currentLabel] += 1                #Label Count 
    shannonEnt = 0.0                                # Experience in entropy ( Shannon entropy )
    for key in labelCounts:                            # Calculate Shannon entropy 
        prob = float(labelCounts[key]) / numEntires    # Select the label (Label) Probability 
        shannonEnt -= prob * log(prob, 2)            # Use the formula to calculate 
    return shannonEnt                                # Back to empirical entropy ( Shannon entropy )

"""
 Function description : Create test data sets 

Parameters:
     nothing 
Returns:
    dataSet -  Data sets 
    labels -  Feature tags 
"""
def createDataSet():
    dataSet = [[0, 0, 0, 0, 'no'],                        # Data sets 
            [0, 0, 0, 1, 'no'],
            [0, 1, 0, 1, 'yes'],
            [0, 1, 1, 0, 'yes'],
            [0, 0, 0, 0, 'no'],
            [1, 0, 0, 0, 'no'],
            [1, 0, 0, 1, 'no'],
            [1, 1, 1, 1, 'yes'],
            [1, 0, 1, 2, 'yes'],
            [1, 0, 1, 2, 'yes'],
            [2, 0, 1, 2, 'yes'],
            [2, 0, 1, 1, 'yes'],
            [2, 1, 0, 1, 'yes'],
            [2, 1, 0, 2, 'yes'],
            [2, 0, 0, 0, 'no']]
    labels = [' Age ', ' There's work ', ' Have your own house ', ' Credit situation ']        # Feature tags 
    return dataSet, labels                             # Return the dataset and classification properties 

"""
 Function description : Divide the data set according to the given characteristics 

Parameters:
    dataSet -  Data set to be partitioned 
    axis -  Partition the characteristics of the data set 
    value -  The value of the characteristic to be returned 
Returns:
     nothing 
"""
def splitDataSet(dataSet, axis, value):
    retDataSet = []                                        # Create a list of returned datasets 
    for featVec in dataSet:                             # Traversing data sets 
        if featVec[axis] == value:
            reducedFeatVec = featVec[:axis]                # Get rid of axis features 
            reducedFeatVec.extend(featVec[axis+1:])     # Add the qualified ones to the returned dataset 
            retDataSet.append(reducedFeatVec)
    return retDataSet                                      # Return the partitioned dataset 

"""
 Function description : Select the best feature 

Parameters:
    dataSet -  Data sets 
Returns:
    bestFeature -  The biggest information gain ( The optimal ) Index value of the feature 
"""
def chooseBestFeatureToSplit(dataSet):
    numFeatures = len(dataSet[0]) - 1                    # Number of features 
    baseEntropy = calcShannonEnt(dataSet)                 # Calculate the Shannon entropy of the data set 
    bestInfoGain = 0.0                                  # Information gain 
    bestFeature = -1                                    # The index value of the optimal feature 
    for i in range(numFeatures):                         # Traverse all features 
        # obtain dataSet Of the i All the features 
        featList = [example[i] for example in dataSet]
        uniqueVals = set(featList)                         # establish set aggregate {}, Element is not repeatable 
        newEntropy = 0.0                                  # Empirical condition entropy 
        for value in uniqueVals:                         # Calculated information gain 
            subDataSet = splitDataSet(dataSet, i, value)         #subDataSet The partitioned subset 
            prob = len(subDataSet) / float(len(dataSet))           # Calculate the probability of a subset 
            newEntropy += prob * calcShannonEnt(subDataSet)     # The empirical conditional entropy is calculated according to the formula 
        infoGain = baseEntropy - newEntropy                     # Information gain 
        # print(" The first %d The gain of each feature is %.3f" % (i, infoGain))            # Print the information gain for each feature 
        if (infoGain > bestInfoGain):                             # Calculated information gain 
            bestInfoGain = infoGain                             # Update information gain , Find the maximum information gain 
            bestFeature = i                                     # Record the index value of the feature with the largest information gain 
    return bestFeature                                             # Returns the index value of the feature with the largest information gain 


"""
 Function description : Statistics classList Most of the elements that appear here in ( Class label )

Parameters:
    classList -  Class tag list 
Returns:
    sortedClassCount[0][0] -  The elements that appear most here ( Class label )
"""
def majorityCnt(classList):
    classCount = {}
    for vote in classList:                                        # Statistics classList The number of occurrences of each element in 
        if vote not in classCount.keys():classCount[vote] = 0
        classCount[vote] += 1
    sortedClassCount = sorted(classCount.items(), key = operator.itemgetter(1), reverse = True)        # Sort in descending order according to the dictionary values 
    return sortedClassCount[0][0]                                # return classList The most frequently occurring element in 

"""
 Function description : Create a decision tree 

Parameters:
    dataSet -  Training data set 
    labels -  Category attribute tag 
    featLabels -  Store the selected optimal feature tag 
"""
def createTree(dataSet, labels, featLabels):
    classList = [example[-1] for example in dataSet]            # Take the classification label ( Whether to lend :yes or no)
    if classList.count(classList[0]) == len(classList):            # If the categories are exactly the same, stop dividing 
        return classList[0]
    if len(dataSet[0]) == 1:                                    # After traversing all features, return the most frequent class label 
        return majorityCnt(classList)
    bestFeat = chooseBestFeatureToSplit(dataSet)                # Select the best feature 
    bestFeatLabel = labels[bestFeat]                            # The label of the best feature 
    featLabels.append(bestFeatLabel)
    myTree = {bestFeatLabel:{}}                                    # According to the optimal feature of the tag generation tree 
    del(labels[bestFeat])                                        # Delete used feature tags 
    featValues = [example[bestFeat] for example in dataSet]        # Get the attribute values of all the optimal features in the training set 
    uniqueVals = set(featValues)                                # Remove duplicate attribute values 
    for value in uniqueVals:                                    # Ergodic characteristics , Create a decision tree .
        myTree[bestFeatLabel][value] = createTree(splitDataSet(dataSet, bestFeat, value), labels, featLabels)
    return myTree

if __name__ == '__main__':
    dataSet, labels = createDataSet()
    featLabels = []
    myTree = createTree(dataSet, labels, featLabels)
    print(myTree)
{' Have your own house ': {0: {' There's work ': {0: 'no', 1: 'yes'}}, 1: 'yes'}}

When creating a decision tree recursively , Recursion has two termination conditions : The first stop condition is that all class tags are exactly the same , Then return the label directly ; The second stop condition is to use all the features , You still can't divide data into groups that contain only unique categories , That is, the decision tree construction fails , Features are not enough . This shows that the data latitude is not enough , Because the second stop condition cannot simply return a unique class label , Here, the category with the largest number of occurrences is selected as the return value .