# Read target detection: r-cnn, fast r-cnn, faster r-cnn, Yolo, SSD

2020-12-07 13:28:53

Read target detection ：R-CNN、Fast R-CNN、Faster R-CNN、YOLO、SSD

## Preface

Before my company opened a series of courses online in July, such as deep learning, which often talked about target detection , Include R-CNN、Fast R-CNN、Faster R-CNN, But there has been no better opportunity to go deep （ But when you have a basic understanding of target detection , Look again These courses You're going to make a lot of money ）. But the field of target detection is so hot , I often see some well written and easy to understand materials , In addition, I took out a book on Jingdong and read it , It's just like this , Still started to study .

May 1 this year , Back to Beijing from Baoding , Afraid of highway congestion No bus , High speed rail didn't catch up with , I had to choose the green car that Kuang Dang Kuang had not taken for years , The key is still running late . At the station , Make a hot spot with your mobile phone , modify Question bank , By the way, I finally found out R-CNN、fast R-CNN、faster R-CNN Core differences . Have a heart to love Don't be afraid of anything .

## One 、 Common target detection algorithms

object detection, Is to find the exact location of the object in a given picture , And mark the category of the object . therefore ,object detection The problem to be solved is the whole process of where the object is and what it is .

However , This problem is not so easy to solve , The size of objects varies widely , The angle at which the object is placed , The attitude is uncertain , And it can appear anywhere in the picture , What's more, objects can be multiple categories .

At present, the target detection algorithms in academic and industrial circles are divided into 3 class ：
1. Traditional target detection algorithm ：Cascade + HOG/DPM + Haar/SVM And many improvements of the above methods 、 Optimize ;

2. candidate region / window + Deep learning classification ： By extracting candidate regions , And to the corresponding region to the depth of learning method based classification scheme , Such as ：
R-CNN（Selective Search + CNN + SVM）
SPP-net（ROI Pooling）
Fast R-CNN（Selective Search + CNN + ROI）
Faster R-CNN（RPN + CNN + ROI）
R-FCN

And so on ;

3. Regression method based on deep learning ：YOLO/SSD/DenseBox Other methods ; And the recent combination RNN Algorithm RRC detection; combination DPM Of Deformable CNN etc.

1） Area selection （ Exhaustive strategy ： Use sliding windows , And set different sizes , Different aspect ratio traverses the image , High time complexity ）
2） feature extraction （SIFT、HOG etc. ; Diversity, morphology 、 Diversity of light changes 、 Background diversity makes the feature robust ）
3） Classifier classification （ There are mainly SVM、Adaboost etc. ）

## Two 、 Traditional target detection algorithm

### 2.1 From the task of image recognition

Here's an image task ： It is necessary to recognize the objects in the picture , And use a box to show where it is .

This task is essentially about these two issues ： One ： Image recognition , Two ： location .

Image recognition （classification）
Input ： picture
Output ： Categories of objects
Evaluation methods ： Accuracy rate

location （localization）
Input ： picture
Output ： The position of the box in the picture （x,y,w,h）
Evaluation methods ： Test evaluation function intersection-over-union（ About what is IOU, See July online APP Question bank big question view deep learning classification next 55 topic ：https://www.julyedu.com/question/big/kp_id/26/ques_id/2138）

Convolutional neural networks CNN We've done image recognition for us （ Decide whether it's a cat or a dog ） Task , We just need to add some extra features to complete the positioning task .

What are the solutions to the problem of positioning ？
Train of thought ： As a matter of return
As a matter of return , We need to predict （x,y,w,h） The values of the four parameters , So we can get the position of the box .

step 1:
• Solve simple problems first , Build a neural network to recognize images
• stay AlexNet VGG GoogleLenet On fine-tuning once （ About what fine tuning is fine-tuning, Please see ：https://www.julyedu.com/question/big/kp_id/26/ques_id/2137）

step 2:
• At the end of the neural network （ That is to say CNN The front remains the same , We are right. CNN Make improvements at the end of ： Add two heads ：“ Sort head ” and “ Back to the head ”）
• Become classification + regression Pattern

step 3:
• Regression That part uses Euclidean distance loss
• Use SGD Training

step 4:
• In the prediction stage, put 2 Put the heads together
• Complete different functions

It needs to be done twice here fine-tuning
For the first time in ALexNet Do on , For the second time, change the head to regression head, The front is the same , Do it once fine-tuning

Regression Where to add the part of ？
There are two ways to deal with it ：
• After the last convolution layer （ Such as VGG）
• After the last full connectivity layer （ Such as R-CNN）

regression It's too hard to do , We should try our best to convert it into classification problem .
regression The convergence time of training parameters is much longer , So the above network takes the use of classification To calculate the connection weights of the common parts of the network .

Train of thought two ： Take the image window
• It's just the same classification + regression Ideas
• Let's take different sizes of “ box ”
• Let the box appear in a different position , Get the decision score for this box
• The box with the highest score

The black box in the upper left corner ： score 0.5

The black box in the upper right corner ： score 0.75

The black box in the lower left corner ： score 0.6

The black box in the lower right corner ： score 0.8

According to the score , We chose the black box in the lower right corner as the prediction of the target position .
notes ： Sometimes I choose the two boxes with the highest score , Then take the intersection of the two boxes as the final location prediction .

doubt ： How big is the box ？
Take different boxes , Scan from the upper left corner to the lower right corner . It's very rough .

Sum up the ideas ：
A pair of pictures , Use boxes of all sizes （ Go through the whole picture ） Cut out the picture , Input to CNN, then CNN The score of this box will be output （classification） And the corresponding picture of this box x,y,h,w（regression）.

This method is too time-consuming , Make an optimization .
The original network is like this ：

Optimize it like this ： Change the full connection layer to the convolution layer , This can speed up .

### 2.2 Object detection （Object Detection）

What to do when there are many objects in the image ？ The difficulty has increased dramatically .

The task becomes ： Multi object recognition + Locate multiple objects
Think of this task as a matter of classification ？

What's wrong with regard to classification ？
• You need to find a lot of places , Give a lot of boxes of different sizes
• You also need to classify the images in the frame
• Of course , If your GPU Very powerful , Okay , Let's do it …

therefore , The main problem of traditional target detection is ：
1） The region selection strategy based on sliding window is not targeted , High time complexity , Window redundancy
2） The features of manual design are not very robust to the change of diversity

regard as classification, Is there any way to optimize ？ I don't want to try so many boxes and so many positions ！

## 3、 ... and 、 candidate region / window + Deep learning classification

### 3.1 R-CNN Born in the sky

There's a good way ： Find out in advance where the target may appear in the graph , That's the candidate area （Region Proposal）. Using the texture in the image 、 edge 、 Color information , It can ensure that fewer windows are selected ( Thousands or even hundreds ） To maintain a high recall rate （Recall）.

therefore , The problem is to find out where the object might be / box （ That's the candidate area / box , For example 2000 Candidate box ）, These boxes can overlap and contain each other , So we can avoid all the boxes of violent enumeration .

The Bulls invented a lot of selected candidate boxes Region Proposal Methods , such as Selective Search and EdgeBoxes. The algorithm used to extract the candidate box “ Selective search ” How do you pick out these candidate boxes ？ Let's take a look at the details PAMI2015 Of “What makes for effective detection proposals？”

The following is a performance comparison of various methods for selecting candidate boxes .

With the candidate area , The rest of the work is actually the work of image classification for candidate regions （ feature extraction + classification ）. For image classification , I have to mention that 2012 year ImageNet Large scale visual recognition challenge （ILSVRC） On , Machine learning champion Geoffrey Hinton The professor leads the students Krizhevsky Using convolutional neural networks will ILSVRC Classified tasks Top-5 error Down to 15.3%, And the second place to use the traditional method top-5 error the height is 26.2%. thereafter , Convolutional neural networks CNN Occupy the absolute dominance of image classification task .

2014 year ,RBG（Ross B. Girshick） Use Region Proposal + CNN Instead of the sliding window used in traditional target detection + Handmade design features , Designed R-CNN frame , It makes a great breakthrough in target detection , And started the upsurge of target detection based on deep learning .

R-CNN The brief steps are as follows
(1) Input test image
(2) Use selective search Selective Search The algorithm extracts the image from bottom to top 2000 About candidate areas that may contain objects Region Proposal
(3) Because the size of the extracted area is different , So we need to put each Region Proposal The zoom （warp） Become unified 227x227 The size of and input to CNN, take CNN Of fc7 The output of the layer as a feature
(4) Each one Region Proposal Extracted CNN Feature input to SVM To classify

The specific steps are as follows
Step one ： Training （ Or download ） A classification model （ such as AlexNet）

Step two ： Do the model fine-tuning
• Divide the number of categories from 1000 Change it to 20, such as 20 Object categories + 1 Background
• Remove the last full connection layer

Step three ： feature extraction
• Extract all candidate frames of the image （ Selective search Selective Search）
• For each area ： Fix area size to fit CNN The input of , Do a forward operation , Pool the output of the fifth layer （ It is the feature extracted from the candidate frame ） Save to hard disk

Step four ： Train one SVM classifier （ Two classification ） To determine the category of objects in this candidate box
One for each category SVM, Judge whether it belongs to this category , Is is positive, conversely nagative.

For example, below , It's dog classification SVM

Step five ： Use regression to fine tune candidate box position ： For each class , Train a linear regression model to determine whether the box is perfectly framed .

Careful students may see the problem ,R-CNN Although it is no longer exhaustive like the traditional method , but R-CNN In the first step of the process, the original image is passed through Selective Search Extracted candidate box region proposal As many as 2000 about , And this 2000 There are candidate boxes, and each box needs to do CNN Feature extraction +SVM classification , It takes a lot of calculation , Lead to R-CNN The detection speed is very slow , Every picture needs 47s.

Is there any way to speed up ？ The answer is yes , this 2000 individual region proposal Isn't it all part of the image , Then we can give a convolution feature to the image , Then just put region proposal The position of the original map is mapped to the convolution feature map , In this way, for an image, we only need to mention the convolution feature once , Then put each region proposal The convolution layer feature is input to the full connection layer for subsequent operations .

But the problem now is that each region proposal The scale is different , The full connection layer input must be a fixed length , So it's definitely not possible to input the full connection layer like this .SPP Net It's a good way to solve this problem .

### 3.2 SPP Net

SPP：Spatial Pyramid Pooling（ Space Pyramid pooling ）

SPP-Net It's from 2015 Years published in IEEE Papers on -《Spatial Pyramid Pooling in Deep ConvolutionalNetworks for Visual Recognition》.

as everyone knows ,CNN It generally contains convolution part and full connection part , among , Convolution layer does not need fixed size image , And the full connection layer is a fixed size input .

So when the full connection layer faces the input data of various sizes , You need to do crop（crop It's to deduct the network input size from a large picture patch, such as 227×227）, or warp（ Put a bounding box bounding box The content of resize become 227×227） Wait for a series of operations to unify the size of the picture , such as 224*224（ImageNet）、32*32(LenNet)、96*96 etc. .

So as you can see above , stay R-CNN in ,“ Because the size of the extracted area is different , So we need to put each Region Proposal The zoom （warp） Become unified 227x227 The size of and input to CNN”.

but warp/crop This kind of pretreatment , The resulting problem is either deformed by stretching 、 Or the object is incomplete , It limits the accuracy of recognition . I don't quite understand ？ In human terms , a sheet 16:9 The scale of the picture you just want to Resize become 1:1 Pictures of the , You said that the picture is distorted ？

SPP Net The author of Kaiming He People think backwards , Since it's all connected FC Layer existence , ordinary CNN We need to fix the size of the input image to fix the input of the full connection layer . The convolution layer can be adapted to any size , Why not add some structure at the end of the convolution layer , Make the input from the full connection layer fixed ？

This “ Turn decay into magic ” The structure of is spatial pyramid pooling layer. Here is R-CNN and SPP Net Comparison of detection process ：

It has two characteristics :
1. Combined with the spatial pyramid method CNNs Multi scale input of .
SPP Net The first contribution is after the last convolution , Access to the pyramidal pool , Ensure that the input to the next full connection layer is fixed .
let me put it another way , In ordinary CNN In Institutions , The size of the input image is often fixed （ such as 224*224 Pixels ）, The output is a vector of fixed dimensions .SPP Net In ordinary CNN The structure adds ROI Pooling layer （ROI Pooling）, The input image of the network can be any size , The output is the same , It's also a vector of fixed dimensions .

in short ,CNN Originally only fixed input 、 Fixed output ,CNN add SSP after , You can type in 、 Fixed output . Amazing ？

ROI The pooling layer is generally behind the convolution layer , In this case, the input of the network can be of any scale , stay SPP layer Every one of them pooling Of filter Will adjust the size according to the input , and SPP The output is a vector of fixed dimensions , And then give the full connection FC layer .

2. Only extract convolution features from the original image once
stay R-CNN in , Each candidate box begins with resize To uniform size , And then as CNN The input of , This is very inefficient .
and SPP Net According to this shortcoming, we have optimized ： Only one convolution calculation of the original image , Then we get the convolution feature of the whole graph feature map, Then find each candidate box in feature map Mapping on patch, Put this patch As the convolution feature of each candidate box, input to SPP layer And later layers , Complete feature extraction .

such ,R-CNN To compute convolution for each region , and SPPNet You only need to compute convolution once , So it saves a lot of computing time , Than R-CNN It's about a hundred times faster .

### 3.3 Fast R-CNN

SPP Net That 's a great way ,R-CNN An advanced version of Fast R-CNN Is in the R-CNN On the basis of the adoption of SPP Net Method , Yes R-CNN Made improvements , To further improve the performance .

R-CNN And Fast R-CNN What are the differences ？
First say R-CNN The shortcomings of ： Even if Selective Search Wait for the preprocessing steps to extract the potential bounding box As input , however R-CNN There will still be serious speed bottlenecks , The reason is obvious , It's all about computers region There will be repeated calculation in feature extraction ,Fast-RCNN It was born to solve this problem .

And R-CNN Frame diagram comparison , It can be found that there are two main differences ： One is to add one after the last convolution layer ROI pooling layer, Second, the loss function uses multi task loss function (multi-task loss), Return the border to Bounding Box Regression Join directly in CNN Training on the Internet （ About what border regression is , See July online APP Question bank big question view deep learning classification next 56 topic ：https://www.julyedu.com/question/big/kp_id/26/ques_id/2139）.

(1) ROI pooling layer It's actually SPP-NET A simplified version of ,SPP-NET For each proposal Different sizes of pyramid maps are used , and ROI pooling layer Just down sample to one 7x7 Characteristic graph . about VGG16 The Internet conv5_3 Yes 512 A feature map , That all region proposal It corresponds to a 7*7*512 The feature vector of the dimension serves as the input of the full connectivity layer .

In other words , This network layer can map different input sizes to a fixed scale feature vector , And we know that ,conv、pooling、relu You don't need to fix it size The input of , therefore , After performing these operations on the original image , Although the input image size Difference leads to feature map It's also different in size , It can't be directly connected to a full connection layer for classification , But you can add this magical ROI Pooling layer , For each region We extract a fixed dimension feature representation , And then through the normal softmax Type identification .

(2) R-CNN The training process is divided into three stages , and Fast R-CNN Use it directly softmax replace SVM classification , At the same time, using the multi task loss function border regression is also added to the network , So the whole training process is end-to-end ( remove Region Proposal Extraction phase ).

in other words , Before R-CNN The processing flow is to first mention proposal, then CNN The extracted features , After use SVM classifier , Do it last bbox regression, And in the Fast R-CNN in , The author skillfully put bbox regression Put it inside the neural network , And region Classify and become a multi-task Model , Practical experiments have proved that , These two tasks can share convolution features , And promote each other .

therefore ,Fast-RCNN One of the most important contributions is to make people see Region Proposal + CNN The hope of this framework for real-time detection , It turns out that multi class detection can improve the processing speed while ensuring the accuracy , Also for later Faster R-CNN It's paved the way .

Draw the point ：
R-CNN There are some considerable shortcomings （ We have corrected all these shortcomings , became Fast R-CNN）.
A big disadvantage ： Because each candidate box has to pass by alone CNN, It takes a lot of time .
solve ： Shared convolution layer , Now not every candidate box is entered as an input CNN 了 , Instead, type in a complete picture , In the fifth convolution layer, we get the characteristics of each candidate box

The original method ： Many candidate boxes （ Like two thousand ）-->CNN--> Get the characteristics of each candidate box --> classification + Return to
Now the way ： A complete picture -->CNN--> Get the characteristics of each candidate box --> classification + Return to

So it's easy to see ,Fast R-CNN be relative to R-CNN The reason for the acceleration is ： But not like R-CNN Feature each candidate area to the depth network , It's the whole picture that mentions features once , Then map the candidate box to conv5 On , and SPP You only need to calculate the feature once , All that's left is conv5 Layer operation is OK .

The performance improvement is also quite obvious ：

### 3.4 Faster R-CNN

Fast R-CNN The problem is ： There are bottlenecks ： Selective search , Find out all the candidate boxes , It's also very time consuming . Can we find a more efficient way to find out these candidate boxes ？

solve ： Add a neural network for edge extraction , That is to say, the job of finding candidate frame is also done by neural network .

therefore ,rgbd stay Fast R-CNN Introduction in Region Proposal Network(RPN) replace Selective Search, Simultaneous introduction anchor box Deal with the change of target shape （anchor It's a fixed position and size box, It can be understood as a fixed set in advance proposal）.

specific working means ：
• take RPN Behind the last convolution layer
• RPN Direct training to get candidate areas

RPN brief introduction ：
• stay feature map Slide up the window
• Build a neural network for object classification + Regression of box position
• The position of the sliding window provides the general position information of the object
• The regression of the box provides a more precise location of the box

A network , Four loss functions ;
• RPN regression(anchor->propoasal)
• Fast R-CNN classification(over classes)
• Fast R-CNN regression(proposal ->box)

Speed comparison

Faster R-CNN The main contribution is to design a network for extracting candidate regions RPN, Instead of time-consuming selective search selective search, The detection speed is greatly improved .

Finally, we summarize the steps of each algorithm ：
RCNN
1. Determine about in the image 1000-2000 Candidate box ( Use selective search Selective Search)
2. The image blocks in each candidate frame are scaled to the same size , And type in CNN Feature extraction within the
3. For the features extracted from the candidate box , Use a classifier to determine whether it belongs to a specific class
4. For candidate boxes belonging to a certain category , Use the regressor to further adjust its position

Fast R-CNN
1. Determine about in the image 1000-2000 Candidate box ( Use selective search Selective Search)
2. Input... For the whole picture CNN, obtain feature map
3. Find each candidate box in feature map Mapping on patch, Put this patch As the convolution feature of each candidate box, input to SPP layer And later layers
4. For the features extracted from the candidate box , Use a classifier to determine whether it belongs to a specific class
5. For candidate boxes belonging to a certain category , Use the regressor to further adjust its position

Faster R-CNN
1. Input... For the whole picture CNN, obtain feature map
2. Convolution feature input to RPN, Get the feature information of the candidate box
3. For the features extracted from the candidate box , Use a classifier to determine whether it belongs to a specific class
4. For candidate boxes belonging to a certain category , Use the regressor to further adjust its position

R-CNN（Selective Search + CNN + SVM）
SPP-net（ROI Pooling）
Fast R-CNN（Selective Search + CNN + ROI）
Faster R-CNN（RPN + CNN + ROI）

in general , from R-CNN, SPP-NET, Fast R-CNN, Faster R-CNN Along the way , The process of target detection based on deep learning becomes more and more simplified , The accuracy is getting higher and higher , Faster and faster . It can be said that it is based on Region Proposal Of R-CNN Serial target detection method is the most important branch in the field of target detection technology .

## Four 、 Regression method based on deep learning

### 4.1 YOLO (CVPR2016, oral)

(You Only Look Once: Unified, Real-Time Object Detection)

Faster R-CNN At present, our method is the mainstream target detection method , But the speed can not meet the real-time requirements .YOLO One kind of method gradually shows its importance , This kind of method uses the idea of regression , Use the whole picture as the input of the network , Directly return the target frame of this position in multiple positions of the image , And the category of the target .

Let's look directly at it YOLO Flow chart of target detection ：

(1) Give an input image , First, divide the image into 7*7 The grid of
(2) For each grid , We all predict 2 Borders （ Including the confidence that each border is the target and the probability that each border area is in multiple categories ）
(3) According to the previous step, we can predict 7*7*2 Target windows , Then, according to the threshold, the target window with low probability is removed , Last NMS Remove the redundant window （ About what is non maximum suppression NMS, See July online APP Question bank big question view deep learning classification next 58 topic ：https://www.julyedu.com/question/big/kp_id/26/ques_id/2141）.

You can see that the whole process is very simple , There's no need for the middle Region Proposal Find a target , Direct regression completes the determination of position and category .

Summary ：YOLO Turn the target detection task into a regression problem , Greatly accelerate the speed of detection , bring YOLO It can be processed every second 45 Zhang image . And because each network prediction target window uses the full picture information , bring false positive The proportion has been greatly reduced （ Full context information ）.

however YOLO There are also problems ： period Region Proposal Mechanism , Use only 7*7 The goal can't be located very precisely , It also leads to YOLO The detection accuracy of is not very high .

### 4.2 SSD

(SSD: Single Shot MultiBox Detector)

It analyzes YOLO The problem is , Use the whole picture feature in 7*7 The location of the target in the rough grid is not very accurate . Is it possible to combine Region Proposal To achieve a more precise positioning ？SSD combination YOLO The idea of return and Faster R-CNN Of anchor The mechanism did this .

Above, SSD A frame diagram of , First SSD How to get the target location and category and YOLO equally , It's all about regression , however YOLO The prediction of a certain location uses the characteristics of the whole map ,SSD Predicting a location uses features around that location （ It feels more reasonable ）.

So how to establish the corresponding relationship between a certain position and its characteristics ？ Maybe you already have , Use Faster R-CNN Of anchor Mechanism . Such as SSD The frame diagram of , Let's say that the feature map of a certain layer ( chart b) Size is 8*8, So use 3*3 The sliding window extracts the features of each position , Then the feature regression gets the coordinate information and category information of the target ( chart c).

differ Faster R-CNN, This anchor It's in multiple feature map On , In this way, we can make use of multi-layer features and naturally achieve multi-scale （ Different layers feature map 3*3 Sliding windows feel different ）.

Summary ：SSD Combined with the YOLO Return to thought and Faster R-CNN Medium anchor Mechanism , Use the multi-scale regional features of each position of the whole map to regress , Both keep YOLO Fast features , It also guarantees the following of window prediction Faster R-CNN It's just as accurate .SSD stay VOC2007 On mAP You can achieve 72.1%, Speed at GPU Up to 58 Frames per second .

## Main reference and extended reading

2 https://mp.weixin.qq.com/s?__biz=MzI1NTE4NTUwOQ==&mid=502841131&idx=1&sn=bb3e8e6aeee2ee1f4d3f22459062b814#rd
3 https://zhuanlan.zhihu.com/p/27546796
4 https://blog.csdn.net/v1_vivian/article/details/73275259
5 https://blog.csdn.net/tinyzhao/article/details/53717136
6 Spatial Pyramid Pooling in Deep Convolutional
Networks for Visual Recognition,by Kaiming He wait forsomeone
7 https://zhuanlan.zhihu.com/p/24774302
8 Zhihu columnist he Zhiyuan's new book 《21 Project play deep learning —— be based on TensorFlow A detailed explanation of the practice of 》
9 YOLO：https://blog.csdn.net/tangwei2014/article/details/50915317,https://zhuanlan.zhihu.com/p/24916786

https://chowdera.com/2020/12/20201207132421095w.html