# Function classification big PK! How to use sigmoid and softmax respectively?

2020-11-08 16:17:51 Design models to perform classification tasks （ As for the chest X Just check the disease or handwritten number to classify ） when , Sometimes you need to choose multiple answers at the same time （ If you choose pneumonia and abscess at the same time ）, Sometimes you can only choose one answer （ Like numbers “8”）. This article will discuss how to apply Sigmoid Function or Softmax Function handles the original output value of the classifier .

There are many kinds of neural network classifier classification algorithms , But the content of this paper is limited to neural network classifier . The classification problem can be solved by different neural networks , Such as feedforward neural network and convolution neural network . application Sigmoid Function or Softmax The final result of FNN classifier is a vector , namely “ The original output value ”, Such as [-0.5, 1.2, -0.1, 2.4], These four outputs correspond to the chest X Pneumonia after light examination 、 Heart hypertrophy 、 Tumors and abscesses . But what do these raw output values mean ？ It may be easier to understand by converting the output value to a probability . Compared with the seemingly casual “2.4”, The possibility of diabetes is 91％, This statement is easier for patients to understand .Sigmoid Function or Softmax Function can map the original output value of classifier to probability . The following figure shows the original output of the feedforward neural network （ Blue ） adopt Sigmoid Functions are mapped to probabilities （ Red ） The process of ： Then use Softmax Function repeats the above process ： As shown in the figure ,Sigmoid Functions and Softmax Function gives different results . The reason lies in ,Sigmoid The function processes the raw output values separately , So the results are independent of each other , The sum of probabilities is not necessarily 1, Pictured 0.37 + 0.77 + 0.48 + 0.91 = 2.53. contrary ,Softmax The output values of functions are related to each other , The sum of the probabilities is always 1, Pictured 0.04 + 0.21 + 0.05 + 0.70 = 1.00. therefore , stay Softmax Function , To increase the probability of a class , The probability of other categories must be reduced accordingly .

Sigmoid Function application ： With the chest X X-ray examination and admission for example, chest X Photo chip ： A chest X Light film can show many diseases at the same time , So the chest X X-ray classifiers also need to display multiple symptoms at the same time . Here is a chest showing pneumonia and abscess X Photo chip , In the tab bar on the right, there are two “1”： be hospitalized ： The goal is based on the patient's health record , Determine the possibility of the patient's admission in the future . therefore , The classification problem can be designed as ： According to the diagnosis, the disease may lead to the patient's admission in the future （ If any ）, Classify the patient's existing health records . There may be a variety of diseases leading to admission , So there may be more than one answer . Chart ： The following two feedforward neural networks correspond to the above problems respectively . In the final calculation , from Sigmoid Function handles the original output value , Get the corresponding probability , Allow multiple possibilities to coexist —— Because of the chest X X-rays may reflect a variety of abnormal states , There may be more than one cause of admission . Softmax Function application ： With handwritten numbers and Iris（ Iris ） For example, handwritten numbers ： Distinguish between handwritten numbers （MNIST Data sets ：https://en.wikipedia.org/wiki/MNIST_database） when , The classifier should use Softmax function , What kind of numbers are . After all , Numbers 8 It's just numbers 8, It can't be numbers at the same time 7. Iris：Iris Data set in 1936 In introducing （https://en.wikipedia.org/wiki/Iris_flower_data_set）, It includes 150 Data sets , Divided into iris 、 Variegated Iris 、 Iris Virginia 3 class , Each category has 50 Data sets , Each data contains calyx length 、 Calyx width 、 Petal length 、 Petal width 4 Attributes . following 9 An example is taken from Iris Data sets ： There are no images in the dataset , But here's the mottled iris （https://en.wikipedia.org/wiki/Iris_flower_data_set#/media/File:Iris_versicolor_3.jpg）, For you to enjoy ： Iris Neural network classifier of data set , To adopt Softmax Function handles the original output value , Because a iris can only be a specific species —— There's no point in dividing it into several varieties .
About “e” We should understand that Sigmoid and Softmax function , We should introduce “e”. In this paper , Just need to know e It's about equal to 2.71828 The mathematical constant of . Here is about e Other information about ：• e The decimal system means forever , The numbers appear completely random —— Be similar to pi.• e Often used in compound interest 、 In the study of gambling and some probability distributions .• Here is e A formula for ： but e There is more than one formula for . There are many ways to calculate it . For example ：https://www.intmath.com/exponential-logarithmic-functions/calculating-e.php• 2004 year , Google's IPO reached 2,718,281,828 dollar , namely “e Million dollars ”.• Wikipedia is the famous decimal number in human history e The evolution of （https://en.wikipedia.org/wiki/E_%28mathematical_constant%29#Bernoulli_trials）, from 1690 One digit of the year begins , Last until 1978 Year of 116,000 Digit number ： Sigmoid Functions and Softmax function Sigmoid = Multi label classification problem = Multiple correct answers = Exclusive output （ For example, the chest X Light check 、 In the hospital ）• Building classifiers , When solving a problem that has more than one correct answer , use Sigmoid The function processes each raw output value separately .• Sigmoid The function is shown below （ Be careful e）： In this formula ,σ Express Sigmoid function ,σ（zj） It means that you will Sigmoid Function applied to a number Zj. “Zj” Represents a single raw output value , Such as -0.5. j Represents the output value of the current operation . If you have four raw output values , be j = 1,2,3 or 4. In the previous example , The original output value is [-0.5,1.2,-0.1,2.4], be Z1 = -0.5,Z2 = 1.2,Z3 = -0.1,Z4 = 2.4. therefore , Z2,Z3、Z4 The calculation process is the same as above . because Sigmoid The function is applied to each of the original output values , So the possible output scenarios include ： All categories have very low probabilities （ Such as “ This chest X There is nothing wrong with light inspection ”）, The probability of one category is high, but the probability of others is very low （ Such as “ chest X The light examination revealed only pneumonia ”）, The probability of multiple or all categories is high （ Such as “ chest X Light examination revealed pneumonia and abscess ”）. The following figure for Sigmoid Function curve ： Softmax = Multi category classification problem = There is only one correct answer = Mutually exclusive output （ For example, handwritten numbers , Iris ）• Building classifiers , When solving a problem with only one correct answer , use Softmax The function processes the raw output values .• Softmax The denominator of the function synthesizes all the factors of the original output value , It means ,Softmax The different probabilities obtained by the function are related to each other .• Softmax The function is expressed as follows ： Except for the denominator , To synthesize all the factors , In the original output value e ^ thing Add up ,Softmax Function and Sigmoid There's not much difference in functions . In other words , use Softmax Function to calculate a single raw output value （ for example Z1） when , You can't just count Z1, In the denominator Z1,Z2,Z3 and Z4 It should also be calculated , As shown below ： Softmax The advantage of the function is that the sum of all the output probabilities is 1： When distinguishing handwritten numbers , use Softmax Function handles the original output value , If you want to add an example, it is divided into “8” Probability , It's going to reduce the example to other numbers （0,1,2,3,4,5,6,7 and / or 9） Probability .Sigmoid and Softmax Other examples of summary :
• If the model output is a non mutex class , And you can select multiple categories at the same time , Then Sigmoid Function to calculate the original output value of the network .
• If the model output is a mutex class , And only one category can be selected , Then Softmax Function to calculate the original output value of the network .