source |Analytics Vidhya
You've successfully built the classification model . What should you do now ？ How do you evaluate the performance of the model , That is, the performance of the model in predicting the results . To answer these questions , Let's use a simple case study to understand the metrics used in evaluating classification models .
Let's take a deeper look at concepts through case studies
In this era of Globalization , People often travel from one place to another . Because the passengers are waiting in line 、 Check in 、 Visit food suppliers and use toilets and other facilities , Airports can bring risks . Tracking passengers carrying the virus at the airport helps prevent the spread of the virus .
Think about it , We have a machine learning model , Divide the passengers into COVID Positive and negative . When making classification prediction , There are four possible types of results ：
Real examples （TP）： When you predict that an observation belongs to a class , And it actually belongs to that category . under these circumstances , In other words, it is predicted that COVID Positive and actually positive passengers .
True counter example （TN）： When you predict that an observation does not belong to a class , It doesn't really belong to that category either . under these circumstances , In other words, the prediction is not COVID positive （ negative ） And it's not really COVID positive （ negative ） Passengers .
False positive example （FalsePositive,FP）： When you predict that an observation belongs to a certain class , When it doesn't belong to this class . under these circumstances , In other words, it is predicted that COVID Positive, but it's not COVID positive （ negative ） Passengers .
False counter example （FN）： When you predict that an observation does not belong to a class , And it actually belongs to that category . under these circumstances , In other words, the prediction is not COVID positive （ negative ） And it's actually COVID Positive passengers .
To better visualize the performance of the model , These four results are plotted on the confusion matrix .
Yes ！ You're right , We want our model to focus on the real positive and negative examples . Accuracy is an indicator , It gives the score our model correctly predicts . Formally , Accuracy has the following definition ：
Accuracy = Correct prediction number / The total number of predictions .
Now? , Let's consider that on average there are 50000 Passengers travel . Among them is 10 Yes COVID positive .
An easy way to improve accuracy is to classify each passenger as COVID negative . So our confusion matrix is as follows ：
The accuracy of this case is ：
Accuracy =49990/50000=0.9998 or 99.98%
magical ！！ That's right ？ that , This really solves the problem of classifying correctly COVID The purpose of the positive passengers ？
For this particular example , We tried to mark the passengers as COVID Positive and negative , Hope to be able to identify the right passengers , I can simply mark everyone as COVID Negative to get 99.98% The accuracy of .
obviously , It's a more accurate method than we've seen in any model . But that doesn't solve the purpose . The purpose here is to identify COVID Positive passengers . under these circumstances , Accuracy is a terrible measure , Because it's easy to get very good accuracy , But that's not what we're interested in .
So in this case , Accuracy is not a good way to evaluate models . Let's take a look at a very popular measure , It's called the recall rate .
Recall rate （ Sensitivity or true case rate ）
The recall rate gives you a score that you correctly identified as positive .
Now? , This is an important measure . Of all the positive passengers , What is the score you correctly identified . Back to our old strategy , Mark every passenger negative , So the recall rate is zero .
Recall = 0/10 = 0
therefore , under these circumstances , Recall rate is a good measure . It said , Identify every passenger as COVID The terrible strategy of negativity leads to zero recall . We want to maximize the recall rate .
As another positive answer to each of the above questions , Please consider it COVID Every question of . Everyone goes into the airport , They're labeled positive by models . It's not good to put a positive label on every passenger , Because before they board the plane , The actual cost of investigating each passenger is enormous .
The confusion matrix is as follows ：
The recall rate will be ：
Recall = 10/(10+0) = 1
It's a big problem . therefore , The conclusion is that , Accuracy is a bad idea , Because putting negative labels on everyone can improve accuracy , But hopefully the recall rate is a good measure in this case , But then I realized , Putting a positive label on everyone also increases the recall rate .
So independent recall rates are not a good measure .
Another method of measurement is called accuracy
The accuracy gives the fraction of all predicted positive results that are correctly identified as positive .
Considering our second wrong strategy , I'm going to mark each passenger as positive , The accuracy will be ：
Precision = 10 / (10 + 49990) = 0.0002
Although this wrong strategy has a good recall value 1, But it has a terrible accuracy value 0.0002.
This shows that recall alone is not a good measure , We need to think about accuracy .
Consider another situation （ This will be the last case , by my troth ：P） Mark the top passengers as COVID positive , That is to mark the disease COVID The most likely passenger . Suppose we have only one such passenger . The confusion matrix in this case is ：
The accuracy is ：1/（1+0）=1
under these circumstances , The accuracy is very good , But let's check the recall rate ：
Recall = 1 / (1 + 9) = 0.1
under these circumstances , The accuracy is very good , But the recall value is low .
|Classify all passengers as negative||high||low||low|
|Classify all passengers as positive||low||high||low|
|The top passengers are marked with COVID positive||high||low||low|
In some cases , We are very sure that we want to maximize recall or accuracy , And at the cost of others . In this case of marking passengers , We really want to be able to correctly predict COVID Positive passengers , Because it's very expensive not to predict the accuracy of passengers , Because it allows COVID Positive people passing through can lead to an increase in transmission . So we're more interested in the recall rate .
Unfortunately , You can't have both ： Improving accuracy will reduce recall , vice versa . This is called accuracy / Recall rate tradeoff .
Accuracy / Recall rate tradeoff
The probability of output of some classification models is between 0 and 1 Between . Before we divide the passengers into COVID In positive and negative cases , We want to avoid missing out on the actual positive cases . especially , If a passenger is really positive , But our model doesn't recognize it , It would be very bad , Because the virus is likely to spread by allowing these passengers to board . therefore , Even if there's a little doubt, there's COVID, We have to put a positive label on it, too .
So our strategy is , If the output probability is greater than 0.3, We mark them as COVID positive .
This leads to higher recall rates and lower accuracy .
Consider the opposite , When we determine that the passenger is positive , We want to classify passengers as positive . We set the probability threshold to 0.9, When the probability is greater than or equal to 0.9 when , Classify passengers as positive , Otherwise it's negative .
So generally speaking , For most classifiers , When you change the probability threshold , There will be a trade-off between recall and accuracy .
If you need to compare different models with different exact recall values , It is usually convenient to combine precision and recall into a single metric . Right ！！ We need an index that considers both recall and accuracy to calculate performance .
It is defined as the harmonic mean of model accuracy and recall rate .
You must wonder why the harmonic average is not the simple average ？ We use harmonic means because it is insensitive to very large values , It's not like a simple average .
For example , We have a precision of 1 Model of , The recall rate is 0 A simple average value is given 0.5,F1 The score is 0. If one of the parameters is low , The second parameter is in F1 Scores don't matter anymore .F1 Scores tend to be classifiers with similar accuracy and recall rate .
therefore , If you want to strike a balance between accuracy and recall ,F1 Score is a better measure .
ROC It's another common assessment tool . It gives the model in 0 To 1 Between the sensitivity and specificity of each possible decision point . For classification problems with probabilistic outputs , The output probability can be converted into a threshold . So by changing the threshold , You can change some numbers in the confusion matrix . But the most important question here is , How to find the right threshold ？
For every possible threshold ,ROC The rate of false positive cases and true cases of curve drawing .
The false positive rate is ： The proportion of counter examples wrongly classified as positive .
True case rate ： The proportion of positive examples correctly predicted as positive examples .
Now? , Consider a low threshold . therefore , Of all the probabilities in ascending order , lower than 0.1 Is considered negative , higher than 0.1 All are considered positive . The selection threshold is free
But if you set your threshold high , such as 0.9.
The following is the first mock exam for the same model under different thresholds ROC curve .
As can be seen from the above figure , The true case rate is increasing at a higher rate , But at some threshold ,TPR It starts to decrease . Every time you add TPR, We have to pay a price —FPR An increase in . In the initial stage ,TPR The increase is higher than FPR
therefore , We can choose TPR High and high FPR Low threshold .
Now? , Let's see TPR and FPR The different values of the model tell us what .
For different models , We're going to have different ROC curve . Now? , How to compare different models ？ As can be seen from the graph above , The curve above represents that the model is good . One way to compare classifiers is to measure ROC The area under the curve .
AUC（ Model 1）>AUC（ Model 2）>AUC（ Model 2）
So the model 1 It's the best .
We learned about the different metrics used to evaluate the classification model . When to use which indicators depends largely on the nature of the problem . So now back to your model , Ask yourself what the main purpose of your solution is , Choose the right indicator , And evaluate your model .
Link to the original text ：https://www.analyticsvidhya.c...
Welcome to join us AI Blog station ：
sklearn Machine learning Chinese official documents ：
Welcome to pay attention to pan Chuang blog resource summary station ：