当前位置:网站首页>[yolof interpretation] you only look one level feature (CVPR 2021)

[yolof interpretation] you only look one level feature (CVPR 2021)

2021-09-15 04:07:15 AI bacteria

 Insert picture description here

This article is published in CVPR2021, The authors are from the Chinese Academy of Sciences 、 UCAS 、 Open vision . The greatest contribution of this paper is to point out FPN in divide and conquer The importance of strategy , Therefore, a simpler and more effective target detection framework is proposed YOLOF.

One 、YOLOF

The author thinks that FPN( Characteristic pyramid network ) The success of not only comes from multi-scale feature fusion , It's more about “ Divide and rule ” Detection strategy . Based on this point of view , The author proposes a simple and efficient network architecture YOLOF.

YOLOF Only the first level feature layer is used for detection , And two key components are introduced : Expansion encoder and uniform matching , It brings good performance improvement . stay COCO A lot of experiments on the benchmark show that ,YOLOF Than YOLOv4 The detection speed is even faster 13%, And the accuracy is quite .

Open source code :https://github.com/megvii-model/YOLOF
 Insert picture description here

Two 、FPN Introduce

At present ,FPN It has almost become an indispensable part of excellent target detection network .FPN There are two main advantages of :

  • Multi scale feature fusion : Multiple low resolution and high resolution feature inputs are fused to obtain better representation
  • Divide and rule : Depending on the scale , Detect different levels of targets respectively

FPN A general idea of is , Its success depends on the fusion of multi-level features , This has led to a series of research on complex fusion methods of manual design . However , This concept ignores FPN The role of divide and rule . This leads to questions about how these two benefits promote FPN There are few successful studies , In turn, it may hinder new progress .

therefore , Firstly, this paper studies FPN The two benefits of are one-stage Effects in the detector . By integrating multi-scale feature fusion and divide and conquer function with RetinaNet Decoupling to design experiments .

As shown in the figure below , The author of MiMo、SiMo、MiSo、SiSo Made a comparison , It's amazing , There is only one input feature C5 Without feature fusion SiMo The encoder can be implemented with MiMo Encoder ( namely FPN) Quite good performance , The performance gap is less than 1 mAP. by comparison ,MiSo and SiSo The performance of the encoder drops sharply (≥ 12 mAP).

These phenomena illustrate two facts :

  • C5 Features have enough context to detect objects of various scales , This makes SiMo The encoder can achieve comparable results
  • The benefits of multiscale feature fusion are far less important than the benefits of divide and conquer , Therefore, multi-scale feature fusion may not be FPN The most significant benefit

 Insert picture description here

3、 ... and 、 Related work

(1) Multistage feature detector

Target detection using multiple features is a traditional technology . The typical methods of constructing multiple features can be divided into image pyramid method and feature pyramid method . Detector based on image pyramid , for example DPM, In the pre deep learning era, it dominated the detection , Its advantage is that it can be used out of the box , And achieve higher performance .

However , Image pyramid method is not the only way to obtain multiple features ; stay CNN Using the power of feature pyramid in the model is more effective 、 A more natural . SSD Firstly, multi-scale features are used to detect objects with different scales on each scale .FPN A semantic rich feature pyramid is constructed by combining shallow features and deep features . after ,PANet stay FPN Feature fusion is further enriched to obtain better representation .

FPN It seems to have become an essential component , And lead the modern industry . It is also suitable for popular single-stage detectors , for example RetinaNet 、FCOS And its variants . Another way to get the feature pyramid is to use multi branch and extended convolution . Different from the above work , The method in this paper is a single-stage feature detector .

(2) Single level features ​​ detector

In the early ,R-CNN series and R-FCN Extract only on a single feature RoI features , Their performance lags behind the corresponding multi-level features ​​ detector . Besides , stay one stage In the detector ,YOLO and YOLOv2 Use only the last output feature of the trunk , They detect fast , But it must bear the decline of detection accuracy .CornerNet and CenterNet Follow this method and the sampling rate is 4 A single feature to detect all objects while achieving competitive results .

Using high-resolution feature map for detection will bring huge memory cost , And it is not conducive to deployment . lately ,DETR Introduced Transformer To test , And show that it uses only a single C5 features (5 Double down sampling ) You can achieve the most advanced results . because anchor-free The mechanism and transformer learning Stage ,DETR It takes long training to converge .

Different from these papers , In this paper, the working mechanism of multistage detector is studied . From the perspective of optimization , This article is a widely used FPN Provides an alternative solution . Besides ,YOLOF Faster convergence and comparable performance ; therefore ,YOLOF It can be used as a simple baseline for fast and accurate detector .

Four 、 How to improve

The author tries to use a simple SiSo Encoder replacement complex MiMo Encoder . But according to the results in the figure below , You can find , Direct application SiSo When encoder , The detection performance of the network will be greatly reduced .
 Insert picture description here

By careful analysis , The author found SiSo Two problems caused by encoder are the reasons for performance degradation :

  • The first question is : C5 The scale range of receptive field matching in feature layer is limited , This hinders the detection performance of objects with different scales .
  • The second question is : Sparse in single-layer features anchor Proposed positive anchors The imbalance of .

Next , These two problems and the solutions provided in this article will be discussed in detail .

(1) Expansion encoder

Pictured 4(a) Shown ,C5 Characteristic receptive fields It can only cover a limited range of scales , If the scale of the object does not match the receptive field , This can lead to poor performance . In order to use SiSo The encoder detects the target of all objects , We must find a way to generate output features with various receptive fields , To make up for the lack of multi-level features .
 Insert picture description here
Firstly, the author expands by stacking standard convolution and extended convolution C5 Characteristic receptive fields . Although the scale of coverage has expanded , But it still can't cover all object scales , Because the expansion process multiplies all original coverage scales by greater than 1 Factor of . We illustrate the graph 4(b) Situation in , And graph 4(a) comparison , The whole scale range is transferred to a larger scale . then , We combine the original scale range with the expanded scale range by adding corresponding features , Thus, output features with multiple receptive fields covering all object scales are generated , Pictured 4. Through... In the middle 3×3 A residual block with expansion is constructed on the convolution layer , This can be easily done .

1) Expansion encoder

Based on the above design , The author puts forward SiSo Encoder , Name it Dilated Encoder, Here's the picture 5 Shown . It contains two main components :Projector and Residual block . The projection layer first applies a 1×1 Convolution layer to reduce channel dimension , And then add a 3×3 Convolution layer to refine semantic context , This is related to FPN The same in . after , The author in 3 × 3 In the convolution layer Stack four with different expansion rates Continuous expansion residual block , To generate output features with multiple receptive fields , Covers the dimensions of all objects .
 Insert picture description here
2) Discuss

Extended convolution is a common strategy to expand the feature receptive field in target detection . such as ,TridentNet Extended convolution is used to generate multiscale features . It deals with the problem of scale change in target detection through multi branch structure and weight sharing mechanism , This is different from our single-stage feature setting .

also ,Dilated Encoder Stack the expanded residual blocks one by one , No weight sharing . Even though DetNet The extended residual block is also applied successively , However, the purpose is to maintain the spatial resolution of the feature and retain more details in the output of the trunk , Our goal is to generate features with multiple receptive fields outside the trunk . Dilated Encoder Our design enables us to detect all objects on single-level features , Not like it TridentNet and DetNet And so on .

(2) Uniform matching

Positive Anchors The definition of is very important for the optimization problem in target detection . In an anchor based detector , Definition Positive Anchors Mainly by measuring the distance between the anchor and the real frame IOU To determine the . stay RetinaNet in , If the maximum of the anchor and the real box IoU Greater than threshold 0.5, The anchor will be set to positive , We call it Max-IoU matching .

stay MiMo Encoding ,anchors Predefined at multiple levels in dense paving ,ground-truth The frame is generated at the feature level corresponding to its scale Positive Anchors. Given the divide and rule mechanism ,Max-IoU Matching enables the real box in each scale to generate a sufficient number of Positive Anchors. However , When we use SiSo When encoder , The number of anchors is the same as MiMo Compared with the in the encoder, it greatly reduces , from 100k To 5k, Resulting in sparse anchor points . In the application Max-IoU When the match , Sparse anchors can cause detector matching problems , Pictured 6 Shown . In nature , Large truth boxes induce more... Than small truth boxes Positive Anchors, This can lead to Positive Anchors The imbalance of . This imbalance makes the detector focus on the large truth box and ignore the small one .

 Insert picture description here

1) Uniform matching

In order to solve this imbalance in the positive anchor , We propose a uniform matching strategy : use k The nearest anchor serves as the positive anchor for each truth box , This ensures that all truth boxes can be evenly matched with the same number of positive anchors , Regardless of their size ( chart 6). The balance in the positive sample ensures that all truth boxes participate in the training and make equal contributions . Besides , Study Max-IoU matching , The author in Uniform Matching Set in IoU Threshold to ignore large IoU (>0.7) Negative anchor and small IoU (<0.15) Positive anchor .

2) Discuss : Relationship with other matching methods

Apply... In the matching process topk Not new . ATSS First, in the L Select... For each ground live box at a feature level topk anchor , Then through dynamic IoU Threshold at k × L Sampling positive anchor in candidate . However ,TSS It focuses on defining positive and negative examples adaptively , Our uniform matching focuses on achieving balance on positive samples with sparse anchors . Although the previous methods have achieved balance on positive samples , But their matching process is not designed for this imbalance . for example ,YOLO and YOLOv2 Match the ground live box to the best match cell or anchor ; DETR Hungarian algorithm is used for matching . These matching methods can be regarded as top1 matching , This is a special case of our unified matching . what's more , The difference between unified matching and learning matching is : Learn matching methods , Such as FreeAnchor and PAA, According to the learning status , And uniform matching is fixed , It doesn't develop with training . Uniform matching is to solve SiSo It is proposed to design the specific imbalance problem of the lower positive anchor point

5、 ... and 、YOLOF Network architecture

Based on the above improvement scheme , This paper presents a fast and direct framework with single-level characteristics YOLOF. This article will YOLOF The network architecture is divided into three parts : The trunk 、 Encoder and decoder . As shown in the figure below :
 Insert picture description here
Backbone network

In all models , This paper adopts ResNet and ResNeXt Series as backbone network . All the models are in ImageNet I did pre training on . The output of the trunk is C5 Characteristics of figure , It has 2048 Channels , The lower sampling rate is 32. For fair comparison with other detectors , All in the trunk BN Layers are frozen by default .

Encoder

For encoders ( chart 5), First, by adding two projection layers after the backbone network ( One 1 × 1 And a 3 × 3 Convolution ) To follow FPN, So as to produce 512 Characteristic graph of channels . then , In order to make the output characteristics of the encoder cover all targets on various scales , We recommend adding residual blocks , It consists of three consecutive convolutions : first 1 × 1 The reduction rate of convolution application is 4 The number of channels is reduced , And then a 3 × 3 Convolution and expansion are used to expand the receptive field , Last ,1 × 1 Number of convolution recovery channels .

decoder

For the decoder , We use RetinaNet Main design of , It consists of two parallel specific task headers : Classification header and regression header . We only added two small changes . The first is that we use DETR in FFN The design of the , Let two head The number of convolution layers is different . There are four convolutions on the regression head , Followed by batch normalization layer and ReLU layer , And there are only two on the classification head . The second is that we follow Autoassign And add an implicit object prediction for each anchor on the regression head ( No direct supervision ). The final classification score of all predictions is generated by multiplying the classification output by the corresponding implicit object .

6、 ... and 、 experimental data

The author in MS COCO On a benchmark YOLOF, And with RetinaNet and DETR Compare , then , Detailed ablation studies and quantitative results and analysis for each component design are provided . Last , In order to provide insights for the further study of single-stage detection , The author provides error analysis and shows YOLOF And DETR Weaknesses compared to .

(1) Details of the experiment

YOLOF stay 8 individual GPU Synchronous use on SGD Training , Every mini-batch in total 64 Zhang image ( Every GPU 8 Zhang image ). All models are based on 0.12 The initial learning rate for training . Besides , stay DETR after , The author sets a small learning rate for the backbone network , That is, the basic learning rate 1/3. To stabilize the initial training , The author changes the number of warm-up iterations from 500 Extend to 1500 Time . For training programs , With increasing batch size ,YOLOF Medium “1×” The total number of schedule settings is 22.5k Sub iteration , And has a cardinality in 15k and 20k In iteration , The learning rate has decreased 10 times . Other schedules are based on Detectron2 Adjust the principles in . For model reasoning , Use threshold is 0.6 Of NMS Post process the results . For other super parameters , follow RetinaNet Set up .

(2) Comparative experiments

And RetinaNet contrast : With the help of multi-scale testing , We got 47.1 mAP The final result and small objects 31.8 mAP Competitive performance .
 Insert picture description here
And DETR contrast : YOLOF Performance on small objects is better than DETR, And lag behind... On large objects DETR. what's more , And DETR comparison ,YOLOF Converge faster ( about 7 times ), Make it better than DETR It is more suitable as a simple baseline for single-stage detector .
 Insert picture description here
And YOLOv4 contrast : YOLOF-DC5 Running speed ratio of YOLOv4 fast 13%, Overall performance improved 0.8 mAP. YOLOF-DC5 The results on small objects are not as good as YOLOv4(24.0 mAP Yes 26.7 mAP), But it performs much better on large objects (7.1 mAP).
 Insert picture description here

(3) Ablation Experiment

Firstly, the author makes a comprehensive analysis of the two proposed components . then , Ablation experiments showing the detailed design of each component .

Use ResNet-50 The effect of expanding encoder and uniform matching : These two components improve the performance of the original single-stage detector 16.6 mAP
 Insert picture description here
The above table shows that both expansion encoder and uniform matching are YOLOF Necessary , And brought considerable improvements . say concretely ,Dilated Encoder Have a significant impact on big goals (43.8 vs. 53.2), The results for small and medium-sized objects are slightly improved .

It turns out that , The limited scale range is C5 features ( The first 4.1 section ) A serious problem in . What this article puts forward Dilated Encoder It provides a simple but effective solution to this problem . On the other hand , Without a uniform match , The performance of small and medium-sized targets decreased significantly (%10AP), The performance of large objects is only slightly affected .

More ablation experiments can be seen in the table below :
 Insert picture description here

7、 ... and 、 summary

In short , The main contribution of this paper is :

  • Pointed out that FPN The most significant advantage is its divide and conquer solution to the optimization problem in dense target detection , Instead of multi-scale feature fusion .
  • One went out FPN A simple and effective baseline YOLOF. stay YOLOF in , Two key components are proposed ,Dilated Encoder and Uniform Matching, It's healed SiSo Encoder and MiMo The performance gap between encoders .
  • stay COCO A large number of experiments have been carried out on the benchmark , Shows the importance of each component . Besides ,YOLOF And RetinaNet、DETR and YOLOv4 Made a comparison , Experiments show that YOLOF Can be in GPU Get comparable results faster on .

版权声明
本文为[AI bacteria]所创,转载请带上原文链接,感谢
https://chowdera.com/2021/09/20210909111002646p.html

随机推荐