当前位置:网站首页>[interpretation of pointpillars] fast encoder for point cloud target detection

[interpretation of pointpillars] fast encoder for point cloud target detection

2021-09-15 04:07:03 AI bacteria

 Insert picture description here

One 、PointPillars Introduce

This paper presents a new method for 3D The method of target detection PointPillars, It USES PointNets To learn the point cloud representation organized in vertical columns . Although coding features can be compared with any standard 2D Convolution detection architecture is used together , However, this paper further proposes a simplified downstream network . Extensive experiments show that ,PointPillars It is much better than the previous encoder in speed and accuracy . Although only lidar is used ,PointPillars stay 3D And aerial view KITTI Benchmarking is significantly better than the existing technology .

Address of thesis :https://arxiv.org/pdf/1812.05784.pdf

Open source code :https://github.com/nutonomy/second.pytorch

 Insert picture description here

Two 、 Related work

Object detection in point cloud is essentially a three-dimensional problem , therefore , Deploy 3D Convolution network detection is very natural . In the most common example , Point clouds are organized as voxels , The voxel set in each vertical column is encoded as a fixed length manual feature encoding , To form a pseudo image that can be processed by a standard image detection architecture . Some famous works here include MV3D、AVOD 、PIXOR and Complex YOLO, They all use variants of the same fixed coding paradigm as the first step in their architecture . The first two methods additionally fuse lidar features with image features to create a multimodal detector . MV3D and AVOD The fusion steps used in force them to use a two-stage detection pipeline , and PIXOR and Complex YOLO Use a single-stage pipeline .

In the early ,Qi Et al. Proposed a simple architecture PointNet, Used to learn from unordered point sets , It provides a complete end-to-end learning approach .VoxelNet It was first deployed in the LIDAR point cloud PointNet One of the methods of target detection . In their approach ,PointNets Apply to voxels , And then a group of 3D Convolution layer processing , And then there was 2D Trunk and detection head . This makes end-to-end learning possible , But with early dependence 3D Convolution works the same ,VoxelNet Slower , A single point cloud requires 225 Millisecond reasoning time (4.4 Hz).

Another recent method is Frustum PointNet, It USES PointNets Yes, by projecting the detection on the image to 3D The point cloud in the generated visual cone is segmented and classified . Compared with other fusion methods ,Frustum PointNet Achieved high benchmark performance , But its multi-stage design makes end-to-end learning impractical . lately ,SECOND Yes VoxelNet A series of improvements have been made , Thus, stronger performance and 20Hz Greatly increased speed . However , They can't get rid of expensive 3D Convolution layer .

therefore , The main contribution of this paper is :

  • A novel point cloud encoder and network architecture are proposed PointPillars, It runs on a point cloud , In order to realize the 3D End to end training of object detection network .
  • Shows how to put all about pillars All calculations are set to intensive 2D Convolution , Thus in 62 Hz Reasoning under , Faster than other methods 2-4 times .
  • Yes KITTI Data sets were tested , And in BEV and 3D The benchmark shows the car 、 The latest results for pedestrians and cyclists .
  • A number of ablation studies were carried out , To examine the key factors for achieving strong detection performance .

3、 ... and 、PointPillars Network architecture

PointPillars Accept a point cloud as input , And estimated to face the car 、 Pedestrians and cyclists 3D box . It consists of three main stages :(1) Feature encoder network for converting point cloud into sparse pseudo image ; (2) One 2D Convolution trunk , Processing pseudo images into high-level representations ; (3) Detection head , For detection and regression 3D box .

 Insert picture description here

(1)Pointcloud to Pseudo-Image

In order to apply 2D Convolution architecture , Firstly, the point cloud is transformed into a pseudo image . How to construct this fake picture ? You can refer to the figure below :
 Insert picture description here
The specific implementation steps are as follows :

1) Enter the point cloud

According to the location of the point cloud data X,Y Axis ( Don't consider Z Axis ) Divide the point cloud data into grids one by one , Any point cloud data falling into a grid is considered to be in a pillar( Cylinder ) in , Or it can be understood that they constitute a Pillar.

One for each point cloud D=9 The vector representation of dimensions , namely ( x , y , z , r , x c , y c , z c , x p , y p ) (x, y, z, r, x_c, y_c, z_c, x_p, y_p) (x,y,z,r,xc,yc,zc,xp,yp). among x, y, z, r It is the real three-dimensional coordinate information and reflection intensity of the point cloud ; x c , y c , z c x_c, y_c, z_c xc,yc,zc For the point cloud Pillar The geometric center of all points in the ; x p , y p x_p, y_p xp,yp by x − x c , y − y c x-x_c, y-y_c xxc,yyc , It reflects the relative position of the point and the geometric center .

2) Pile up Pillars

Suppose that there are... In each sample P A non empty pillars, Every pillar There is N Point cloud data , Then this sample can use a (D, P, N) Tensor representation of . however , How to ensure that each Pillar There happens to be N Point cloud data ?

If each pillar The point cloud data in exceeds N individual , Then we randomly sample to N individual ; If each pillar The point cloud data in is less than N individual , We fill the less part with 0; therefore , It's easy to convert a point cloud into (D, P, N) Stacking of tensor schemes Pillars.

3) Feature learning

Get stacked Pillars after , The author uses a simplified version of PointNet Processing and feature extraction of Zhang quantized point cloud data . Feature extraction can be understood as processing the dimension of point cloud , The original point cloud dimension is D=9 , The processed dimension is C , Then we get a (C, P, N) Tensor .

next , according to Pillar The dimension in which you are Max Pooling operation , You get (C, P) Feature graph of dimension .

4) Pseudoimage

In order to obtain pseudo picture features , The author will P Turn into (W, H). So we finally get the shape like (C, H, W) A fake picture of . The specific process is as follows :
 Insert picture description here

(2)Backbone

As shown in the figure below , In this paper, we use VoxelNet Similar backbone networks . The backbone network consists of two sub networks : A top-down network produces features with smaller and smaller spatial resolution ; The second network performs up sampling and series top-down functions .

 Insert picture description here
The top-down trunk can use a series of blocks Block(S, L, F) To represent . Each block in steps S( The pseudo image is measured relative to the original input ) function . One block has L individual 3x3 2D Convolution layer and the F Output channels , Each channel is followed by BatchNorm And a ReLU. The first convolution in the layer has a step S / S i n S/S_{in} S/Sin To ensure that the block receives the stride S i n S_{in} Sin Enter in steps S function . The step size of all subsequent convolutions in a block is 1.

The final features of each top-down block are combined by up sampling and cascading , The steps are as follows :

  • First , Use with F Transpose of final features 2D Convolution up samples features ,Up( S i n S_{in} Sin, S o u t S_{out} Sout, F) From the initial stride S i n S_{in} Sin To the final stride S o u t S_{out} Sout( Measure the original pseudo image again )
  • Next , take BatchNorm and ReLU Features applied to upsampling .
  • The final output feature is the concatenation of all features from the asynchronous amplitude .

(3)Detection Head

In this paper , Author use Single Shot Detector (SSD) To set and execute 3D Object detection . And SSD similar , Use 2D Medium IoU Match the a priori box with the real box . The height and elevation of the bounding box are not used to match ; contrary , Given 2D matching , Altitude and altitude become additional regression targets .

Four 、 Loss function

This article uses and SECOND The same loss function . The truth box and anchor are composed of (x, y, z, w, l, h, θ) Definition . The positioning regression residual between the ground truth and the anchor point is defined as :

 Insert picture description here
among , x g t x^{gt} xgt and x a x^a xa Namely ground truth and anchor box. also ,

 Insert picture description here
The total positioning loss is :
 Insert picture description here
The flip box cannot be distinguished due to the loss of angular positioning , In this paper, in the discrete direction L d i r L_{dir} Ldir Upper use softmax Classified loss , This enables the network to learn the course . For target classification loss , This article USES the Focal Loss:
 Insert picture description here
among , p a p_a pa Is the category probability of the anchor . We use α = 0.25 and γ = 2 The original paper setup . Therefore, the total loss is :

 Insert picture description here

among , N p o s N_{pos} Npos Is the number of positive anchors , β l o c = 2 , β c l s = 1 , β d i r = 0.2 β_{loc}=2,β_{cls}=1,β_{dir}=0.2 βloc=2,βcls=1,βdir=0.2.

In order to optimize the loss function , In this paper, the initial learning rate is 2 ∗ 1 0 − 4 2*10^{−4} 2104 Of Adam Optimizer , Every time 15 individual epoch Attenuate the learning rate 0.8 times , And train 160 individual epoch. We use a batch size of... For the validation set 2, The batch size used for Test submission is 4.

5、 ... and 、 Experimental verification

(1) Data sets

All experiments in this paper use KITTI Target detection reference data set , The dataset consists of samples with LIDAR point clouds and images . We only train on LIDAR point clouds , But compared with the fusion method using lidar and image . The samples were initially divided into 7481 Training samples and 7518 Three test samples . For experimental research , We divide official training into 3712 Training samples and 3769 Validation samples , And for our test submission , We created a file from the validation set that contains 784 The minimum set of samples , And in the remaining 6733 Training was conducted on samples .

KITTI Benchmarking requires testing cars 、 Pedestrians and cyclists . Because real box objects are annotated only when they are visible in the image , Therefore, we follow the standard convention of using only lidar points projected into the image . follow KITTI Standard literature practice , We trained a network for cars , A network has been trained for pedestrians and cyclists .

(2) Data to enhance

Data enhancement for KITTI Good performance of benchmarking is crucial . First , stay SECOND after , We are for all classes and fall in these 3D The relevant point cloud in the box creates a real 3D Box lookup table . And then for each sample , We are cars 、 Pedestrians and cyclists randomly choose 15,0,8 A real box sample , And put them into the current point cloud . We find that the performance of these settings is better than the proposed settings .

Next , All real boxes are added separately . Each frame is rotated ( Evenly from [−π/20, π/20] Extract from ) Move peacefully (x、y and z Independently from N(0,0.25) Extract from ) To further enrich the training set .

Last , We performed two sets of global enhancements applied to the point cloud and all boxes . First , We along x Axis applies a random mirror flip , Then perform global rotation and scaling . Last , We use from N(0,0.2) Draw the x、y、z Global translation is applied to simulate positioning noise .

(3) Comparative experiments

As shown in the table 1 And table 2 Shown ,PointPillars In average accuracy (mAP) Superior to all published methods . Compared with the method using only lidar ,PointPillars Better results were achieved in all categories and difficulty levels . Except for the simple car floor , It is also superior to the fusion based approach for cars and cyclists .

 Insert picture description here
This paper is in figure 3 Sum graph 4 Qualitative results are provided in . Although we only train on LIDAR point clouds , But for ease of explanation , We from BEV And the angle of the image 3D Bounding box prediction .
 Insert picture description here

although PointPillars Predicted 3D The direction of the box , but BEV and 3D The indicator does not consider the direction . Use AOS Evaluation direction , This requires that 3D The frame is projected into the image , perform 2D Detect a match , Then evaluate the direction of these matches . There are only two types of prediction orientation box 3D Compared with the detection method ,PointPillars stay AOS The performance on all layers significantly exceeds , As shown in the table 3 Shown :

 Insert picture description here

6、 ... and 、 summary

In this paper , The author introduces a novel depth network and encoder PointPillars, End to end training can be carried out on LIDAR point cloud . Experimental proof , stay KITTI In the challenge ,PointPillars By providing higher detection performance at a faster speed (BEV and 3D Upper mAP) To dominate all existing methods . It turns out that ,PointPillars For lidar 3D Target detection provides the best architecture so far .

版权声明
本文为[AI bacteria]所创,转载请带上原文链接,感谢
https://chowdera.com/2021/09/20210909111002637S.html

随机推荐