# Fgagt: flow guided adaptive graph tracking

2020-11-08 08:04:31

# One 、 Abstract

What this article puts forward FGAGT Tracker , differ FairMoT, Use Lucas Pyramid optical flow method predicts the position information of historical target in the current frame , Simultaneous utilization ROI Pooling Feature vectors of the whole layer and connected objects are extracted in the current frame . Then the eigenvectors of them and the new target are input into the adaptive graph neural network to update the eigenvectors , The adaptive graph neural network can update the feature vector of the target by combining the global historical position and appearance information . Because of the preservation of historical information , He can recognize the blocked object again . Training phase , Put forward a balance MSE Loss to balance the sampling distribution . The reasoning stage , Data association using Hungarian algorithm .FGAGT When you believe in matching your goals , Location information and appearance features must be used in order to get a good tracking result , This is also the core of the algorithm .
Innovation points ：

1. Use optical flow to predict the position of the historical object in the current frame , And extract features . Compared with other feature maps of past frames, feature vectors are extracted , Our method avoids extracting features from different feature maps , In this way, the new detected target features and the extracted historical features of the target will be consistent .
2. Using adaptive graph network to integrate the information of space-time appearance location , And to update features and recognize occluded objects （ Weighted learning features ）.
3. Put forward a balance MSE Loss to balance different sampling distributions .

# Two 、 Brief introduction

Multi target tracking needs to match multiple targets detected in the current frame according to the target's historical trajectory , The same goals need to be assigned the same id, And form a new trajectory , New goals need to be assigned a new one id And the vanishing target needs to delete the track , And can't be matched .

problem ： Although many methods use optical flow to predict location information , But in the process of feature matching , The features of the historical target and the newly detected target are obtained from different feature maps , This will lead to the same target feature vector will always be different , It's even quite different . Such a target disappears because of occlusion , When it appears again in the field of vision , How to match the historical trajectory becomes a problem . And the past target can only match one target of the current frame , Long term historical goal more than one , Historical targets are likely to match new ones （ Wrong match ）, As a result, there will be very few new goals .

solve ： Put forward FGAGT Tracker , Use Lucas The pyramid algorithm calculates the center position of the historical target in the current frame .

Then the current frame is sampled by convolutional neural network to obtain the feature map , Use ROI Pooling Extract historical and newly detected targets with full connection layer （ The target in the current frame ） The initial appearance feature vector of , Input them into the adaptive graph network , Historical targets and new detection targets are treated as bipartite graphs , The distance between the initial appearance features and IOU Is considered the weight of the edge , Combining global spatiotemporal location and appearance information , Input into the graph network , To update the eigenvector .
In the network , The integration features of each dimension are multiplied by an adaptive weight , To strengthen the learning target in the most able to solve occlusion and recognition problems of the characteristics . The final output of graph network is a similarity matrix . Training phase , Came up with a MSE Loss, Specifically, they are continuously positive , Continuously negative , The loss function of new and vanishing targets is multiplied by a coefficient negatively related to the number of samples , It is helpful to solve the problem of unbalanced sample size distribution , So that graph network can learn the feature vector we want .

## 2.1 Optical flow

Optical flow ： It refers to the amount of pixel movement from one frame to the next , Represented by a two-dimensional vector . Select the sparse point of image pixels for the optical flow evaluation , The optical flow assessment can be divided into sparse and dense optical flow assessment . Sparse optical flow selects obvious features （ Big gradients ） For optical flow evaluation and tracking . Dense optical flow describes the movement of all pixels in an image to the next frame , Use color as the direction of light flow , Brightness represents the size of the optical flow .

LK (Lucas-Kanade) The algorithm has three assumptions ：1. Brightness is constant ,2. Target movement is small ,3. The optical flow in the neighborhood is consistent .
If all the hypotheses are satisfied , The optical flow differential equation of the selected pixel from the current frame to the next frame can be constructed ：
Consider the pixel brightness intensity of the first frame I ( x ,   y ,   t ) I\left(x,\ y,\ t\right) I(x, y, t)（t Represents the time dimension ,（x,y） For location ）, To the next frame , This pixel moves ( d x ,   d y ) \left(dx,\ dy\right) (dx, dy) Distance of , It cost d t dt dt Time for . Because it's the same pixel , And according to the first hypothesis, the brightness of the pixel is constant before and after the motion , namely :
I ( x , y , t ) = I ( x + d x , y + d y , t + d t ) (1) I(x, y, t)=\mathrm{I}(\mathrm{x}+\mathrm{d} \mathrm{x}, \mathrm{y}+\mathrm{d} \mathrm{y}, \mathrm{t}+\mathrm{dt}) \tag1 I(x,y,t)=I(x+dx,y+dy,t+dt)(1)
On the right, Tyler unfolds :
I ( x , y , t ) = I ( x , y , t ) + ∂ I ∂ x d x + ∂ I ∂ y d y + ∂ I ∂ t d t + ε (2) I(x, y, t)=\mathrm{I}(\mathrm{x}, \mathrm{y}, \mathrm{t})+\frac{\partial I}{\partial x} d x+\frac{\partial I}{\partial y} d y+\frac{\partial I}{\partial t} d t+\varepsilon \tag2 I(x,y,t)=I(x,y,t)+xIdx+yIdy+tIdt+ε(2)
ε \varepsilon ε It's second order infinitesimal , You can ignore . And then the formula （2） Bring in the formula （1）, At the same time divide by d t dt dt get :
∂ I ∂ x d x d t + ∂ I ∂ y d y d t + ∂ I ∂ t d t d t ≈ 0 (3) \frac{\partial I}{\partial x} \frac{d x}{d t}+\frac{\partial I}{\partial y} \frac{d y}{d t}+\frac{\partial I}{\partial t} \frac{d t}{d t} \approx 0 \tag3 xIdtdx+yIdtdy+tIdtdt0(3)
hypothesis u,v Respectively x Axis and y The velocity vector of the axis ： u = d x d t , v = d y d t (4) \mathrm{u}=\frac{d x}{d t}, v=\frac{d y}{d t}\tag4 u=dtdx,v=dtdy(4)
I x = ∂ I ∂ x I_x=\frac{\partial I}{\partial x} Ix=xI, I y = ∂ I ∂ y I_y=\frac{\partial I}{\partial y} Iy=yI, I t = ∂ I ∂ t I_t=\frac{\partial I}{\partial t} It=tI Represents the picture along X Axis ,Y Axis ,t The partial derivative of the pixel gray value in the axis direction . be （3） Can be described as ： I x u + I y v + I t ≈ 0 (5) I_{x} u+I_{y} v+I_{t} \approx 0\tag5 Ixu+Iyv+It0(5)
among I x ,   I y , I t I_x,\ I_y,I_t Ix, Iy,It All can be done by finite difference （ A kind of partial differential （ Or ordinary differential ） The method of numerical solution to the problem of definite solution of equation and system of equations ） obtain , ( u ,   v ) \left(u,\ v\right) (u, v) Is the required optical flow vector . from （5） Estimate an objective function ：
min ⁡ ∑ ( x , y ∈ Ω ) W 2 ( X ) ( I x u + I y v + I t ) 2 (6) \min \sum_{(x, y \in \Omega)} W^{2}(X)\left(I_{x} u+I_{y} v+I_{t}\right)^{2}\tag6 min(x,yΩ)W2(X)(Ixu+Iyv+It)2(6)
In the above formula W 2 ( X ) W^2\left(X\right) W2(X) It's a weight function , The farther away from the neighborhood center , The less weight . Give Way
V = ( d x d t , d y d t ) = ( u , v ) T , ∇ I ( X ) = ( ∂ I ∂ x , ∂ I ∂ y ) = ( I x , I y ) T V=\left(\frac{dx}{dt},\frac{dy}{dt}\right)=(u,v)^T,\nabla I(X)=(\frac{\partial I}{\partial x},\frac{\partial I}{\partial y}){=\left(I_x,I_y\right)}^T V=(dtdx,dtdy)=(u,v)T,I(X)=(xI,yI)=(Ix,Iy)T
A = ( ∇ I ( X 1 ) , … , ∇ I ( X n ) ) T A=\left(\nabla I\left(X_1\right),\ldots,\nabla I\left(X_n\right)\right)^T A=(I(X1),,I(Xn))T
W = d i a g ( W ( X 1 ) , … W ( X n ) ) W=diag{\left(W\left(X_1\right),\ldots W\left(X_n\right)\right)} W=diag(W(X1),W(Xn))
b = − ( ∂ I ( X 1 ) ∂ t , … , ∂ I ( X n ) ∂ t ) T b=-\left(\frac{\partial I\left(X_1\right)}{\partial t},\ldots,\frac{\partial I\left(X_n\right)}{\partial t}\right)^T b=(tI(X1),,tI(Xn))T
n Represents the number of neighboring pixels . And then use Least square method Find out the formula (6) Solution :

V = ( A T W 2 A ) − 1 A T W 2 b (7) \mathrm{V}=\left(A^{T} W^{2} A\right)^{-1} A^{T} W^{2} b\tag7 V=(ATW2A)1ATW2b(7)
Finally, you can use Newton-Raphson Method to get more accurate values .

The pyramid LK Algorithm
because LK The algorithm requires that the displacement must be small enough , therefore Pyramid-LK The algorithm solves this problem ：
Reduce the size of the image layer by layer , Form a pyramid of pictures , The scale size of the image is half of the width and height of the bottom layer . This translates a large displacement into a small displacement , And then use LK The algorithm calculates the optical flow from the top , Optical flow correction at each layer , Know the bottom, get the final light flow .

gold word tower L K count Law in It means chart The pyramid LK Algorithm diagram gold word tower LK count Law in It means chart

## 2.1 Figure neural network （GNN）

The early graph networks were mostly used for molecular structure classification , in fact , European space （ For example, pictures ） Or sequence （ for example text Text ） And other common scenes can also be transformed into pictures , Then the graph neural network technology is used to process .
until 2013 year ,Bruna Based on graph signal processing, a graph convolution neural network based on spectral domain and spatial domain is proposed , Since then GNN Began to demonstrate its powerful capabilities .
For a picture G G G, Every node v It has its own eigenvector X v X_v Xv. Each edge connects two nodes , And for connected nodes v , u v,u v,u, This edge also has its own eigenvector X ( v , u ) X_{\left(v,u\right)} X(v,u).GNN The goal is to get the graph perception of each node h v h_v hv The hidden state of , For each node , His hidden state contains information obtained from neighboring nodes . The graph is implemented by iterating to update the hidden state of all nodes GNN. stay c   +   1 c\ +\ 1 c + 1 layer , node v v v The update status of is ： h v c + 1 = f ( x v , x c o [ v ] , h n e c [ v ] , x n e [ v ] ) (8) \mathbf{h}_{v}^{c+1}=f\left(\mathbf{x}_{v}, \mathbf{x}_{c o}[v], \mathbf{h}_{n e}^{c}[v], \mathbf{x}_{n e}[v]\right)\tag8 hvc+1=f(xv,xco[v],hnec[v],xne[v])(8)
On the type of f f f Is a state update function for hidden states , We use neural networks instead of . x c o [ v ] \mathbf{x}_{co}\left[v\right] xco[v] The representative is the node v v v The eigenvector corresponding to the adjacent edges of , x n e [ v ] \mathbf{x}_{ne}\left[v\right] xne[v] Represents a node v v v The eigenvector of the adjacent nodes , h n e c [ v ] \mathbf{h}_{ne}^c\left[v\right] hnec[v] On behalf of the 𝑐 Layer time , node v v v The hidden state of the neighbor nodes of . f f f Is a global function shared by all nodes .

# 3、 ... and 、 FGAGT Network structure

Structure diagram is as follows ：

Pictured , The current frame I t I_t It adopt backbone It becomes a feature map , Using the current frame after detection M individual bbox Extract regional features , And then through ROI Pooling And all connected layers become eigenvectors . meanwhile I t − 1 I_{t-1} It1 The history of the frame in the current frame bbox adopt LK Pyramid algorithm to predict , And then add the past T T T（ remove I t − 1 I_{t-1} It1 frame ） In the frame bbox, altogether N A history bbox adopt ROI Pooling And the full connectivity layer is transformed into eigenvectors .
Input graph neural network has two parts ： Of the current frame M The sum of the feature vectors of the object's appearance bbox and In the past T The historical object's appearance feature vector of the frame and bbox.
It should be noted that , The appearance feature vector is extracted from the feature map of the current frame , Instead of the historical object extracted from the feature map of the historical frame . Finally, the output similarity matrix S   ∈   R M × N S\ \in\ R^{M\times N} S  RM×N.

## 3.1 Adaptive graph neural network model

as follows , The newly detected target and historical target are regarded as a bipartite graph , Here's the picture （ This is the same as the meaning described above, which optimizes the information of history and current detection through the idea of iteration ）：

As shown on the left in the picture above , Any new detected target is associated with all historical targets , But there is no correlation between any two new detection targets or two historical targets . In order to get the hidden state between points h v h_v hv, According to the formula （8）, Let's do iterative updates ：
( c + 1 ) h d i = f ( ( c ) h d i , { ( c ) h t j , ( c ) e d i , j } j = 1 N ) , i = 1 , 2 , ⋯   , M ( c + 1 ) h t j = f ( ( c ) h t j , { ( c ) h d i , ( c ) e t j , i } i = 1 M ) j = 1 , 2 , ⋯   , N \begin{aligned} {}^{(c+1) }& h_{d}^{i}=f\left({ }^{(c)} h_{d}^{i},\left\{ { }^{(c)} h_{t}^{j},{ }^{(c)} e_{d}^{i, j}\right\}_{j=1}^{N}\right), i=1,2, \cdots, M \\ { }^{(c+1)} h_{t}^{j} &=f\left({ }^{(c)} h_{t}^{j},\left\{ { }^{(c)} h_{d}^{i},{ }^{(c)} e_{t}^{j, i}\right\}_{i=1}^{M}\right) j=1,2, \cdots, N \end{aligned} (c+1)(c+1)htjhdi=f((c)hdi,{ (c)htj,(c)edi,j}j=1N),i=1,2,,M=f((c)htj,{ (c)hdi,(c)etj,i}i=1M)j=1,2,,N
f f f Is a hidden state update function , Use neural networks instead of . In this paper , We use a single layer GNN, And added an adaptive part .

AGNN（ Adaptive graph neural networks ） Using the existing location and feature prior information as an edge e i , j e_{i,j} ei,j Weight of the weight to integrate the target feature vector and update all the target features .
Integration steps as follows :

1. First calculate the initial feature similarity ：
s i , j = 1 ∥ f d i − f t j ∥ 2 2 + 1 × 1 0 − 16 s i , j = s i , j s i , 1 2 + s i , 2 2 + ⋯ s i , j 2 + ⋯ + s i , N 2 S f t = [ s i , j ] M × N , i = 1 , 2 , ⋯ M , j = 1 , 2 , ⋯ N \begin{aligned} s_{i, j}=& \frac{1}{\left\|f_{d}^{i}-f_{t}^{j}\right\|_{2}^{2}+1 \times 10^{-16}} \\ s_{i, j}=& \frac{s_{i, j}}{\sqrt{s_{i, 1}^{2}+s_{i, 2}^{2}+\cdots s_{i, j}^{2}+\cdots+s_{i, N}^{2}}} \\ \mathbf{S}_{\mathrm{ft}}=&\left[s_{i, j}\right]_{M \times N}, \quad i=1,2, \cdots M, j=1,2, \cdots N \end{aligned} si,j=si,j=Sft=fdiftj22+1×10161si,12+si,22+si,j2++si,N2 si,j[si,j]M×N,i=1,2,M,j=1,2,N
2. Calculate the ratio of intersection and union , and 1 A priori matrix of similarity is formed ：
E = w × I O U + ( 1 − w ) × S f t \mathrm{E}=w \times \mathrm{IOU}+(1-w) \times \mathbf{S}_{\mathrm{ft}} E=w×IOU+(1w)×Sft Parameters w The measurement is the degree of correlation between location information and appearance feature information , For the initial 0.5.
3. Integration of the information ： F t a g = E F t = E [ f t 1 f t 2 ⋮ f t N ] \mathbf{F}_{\mathrm{t}}^{\mathrm{ag}}=\mathrm{EF}_{t}=\mathrm{E}\left[\begin{array}{c} f_{t}^{1} \\ f_{t}^{2} \\ \vdots \\ f_{t}^{N} \end{array}\right] Ftag=EFt=Eft1ft2ftN
Then update the feature ： H d = σ ( F d W 1 + Sigmoid ⁡ ( F d W a ) ⊙ F t a g W 2 ) (9) \mathbf{H}_{\mathrm{d}}=\sigma\left(\mathbf{F}_{d} W_{1}+\operatorname{Sigmoid}\left(\mathbf{F}_{d} W_{a}\right) \odot \mathbf{F}_{\mathrm{t}}^{\mathrm{ag}} W_{2}\right)\tag9 Hd=σ(FdW1+Sigmoid(FdWa)FtagW2)(9) Simultaneous updating ： H t = σ ( F t W 1 + Sigmoid ⁡ ( F t W a ) ⊙ F d a g W 2 ) (10) \mathbf{H}_{\mathrm{t}}=\sigma\left(\mathbf{F}_{t} W_{1}+\operatorname{Sigmoid}\left(\mathbf{F}_{t} W_{a}\right) \odot \mathbf{F}_{\mathrm{d}}^{\mathrm{ag}} W_{2}\right)\tag{10} Ht=σ(FtW1+Sigmoid(FtWa)FdagW2)(10)
⊙ \odot It's dot product . The values of the object feature vector in different dimensions represent the features of different parts of the captured object , Different parts may be the key to distinguishing objects . Therefore, aggregation information needs to be weighted to different dimensional features , Set an adaptive parameter W a W_a Wa, S i g m o i d ( F d W a ) Sigmoid{\left(\mathbf{F}_dW_a\right)} Sigmoid(FdWa) As adaptive weights .
After updating the features , Through the full connectivity layer , To reduce the dimensions of the updated features of graph networks , And calculate the Euclidean distance to measure the similarity between features . We only need a simple monolayer graph network to normalize the output eigenvectors , Then the similarity matrix can be obtained by matrix multiplication : h d i = h d i ∥ h d i ∥ 2 h t j = h t j ∥ h t j ∥ 2 S out  = H d H t T (11) \begin{aligned} h_{d}^{i} &=\frac{h_{d}^{i}}{\left\|h_{d}^{i}\right\|_{2}} \\ h_{t}^{j} &=\frac{h_{t}^{j}}{\left\|h_{t}^{j}\right\|_{2}} \\ \mathbf{S}_{\text {out }} &=\mathbf{H}_{\mathrm{d}} \mathbf{H}_{\mathrm{t}}^{\mathrm{T}} \end{aligned}\tag{11} hdihtjSout =hdi2hdi=htj2htj=HdHtT(11)
Same target output 1, Different target outputs 0. Its purpose is to make the eigenvectors of the same target approximately coincide , The eigenvectors of different targets are approximately vertical . When the eigenvectors are not normalized and the elements are greater than 0 when , The closer Euclidean distance is, which is equivalent to the feature between the same target , The longer the Euclidean distance between different targets, the better .

## 3.2 Balance MSE Loss

Of all the goals , New and disappearing targets are in the minority , So their share in the loss is correspondingly small . For each goal , There is at most one positive sample （label=1）, The rest of them are negative samples （label=0）, The sampling is extremely unbalanced . So I thought of multiplying each loss by a coefficient , To achieve a balanced effect .
L = α E c 0 + β E c 1 + γ E n e + δ E d + ε E w = ∑ i = 1 M ∑ j = 1 N [ α ( S ^ i , j − S i , j ) 2 ⋅ I continue  ⋅ I S i , j = 0 + β ( S ^ i , j − S i , j ) 2 ⋅ I continue  ⋅ I S i , j = 1 + γ ( S ^ i , j − S i , j ) 2 ⋅ I n e w + δ ( S ^ i , j − S i , j ) 2 ⋅ I disap  + ε ∥ W ∥ 2 2 ] (12) \begin{aligned} \mathcal{L} &=\alpha E_{c 0}+\beta E_{c 1}+\gamma E_{n e}+\delta E_{d}+\varepsilon E_{w} \\ &=\sum_{i=1}^{M} \sum_{j=1}^{N}\left[\begin{array}{c} \alpha\left(\hat{S}_{i, j}-S_{i, j}\right)^{2} \cdot \mathbb{I}_{\text {continue }} \cdot \mathbb{I}_{S_{i, j}=0}+\beta\left(\hat{S}_{i, j}-S_{i, j}\right)^{2} \cdot \mathbb{I}_{\text {continue }} \cdot \mathbb{I}_{S_{i, j}=1} \\ +\gamma\left(\hat{S}_{i, j}-S_{i, j}\right)^{2} \cdot \mathbb{I}_{n e w}+\delta\left(\hat{S}_{i, j}-S_{i, j}\right)^{2} \cdot \mathbb{I}_{\text {disap }}+\varepsilon\|W\|_{2}^{2} \end{array}\right] \end{aligned}\tag{12} L=αEc0+βEc1+γEne+δEd+εEw=i=1Mj=1Nα(S^i,jSi,j)2Icontinue ISi,j=0+β(S^i,jSi,j)2Icontinue ISi,j=1+γ(S^i,jSi,j)2Inew+δ(S^i,jSi,j)2Idisap +εW22(12)
I c ( S i , j ) = { 1 ,  if  S i , j  is the c target  0 ,  if  S i , j  isn’t the c target  \mathbb{I}_{c}\left(S_{i, j}\right)=\left\{\begin{array}{c} 1, \text { if } S_{i, j} \text { is the c target } \\ 0, \text { if } S_{i, j} \text { isn't the c target } \end{array}\right. Ic(Si,j)={ 1, if Si,j is the c target 0, if Si,j isn’t the c target  among 𝛼,𝛽,𝛾,𝛿,𝜀 It's a superscript .

## 3.3 Reasoning

Similarity matrix obtained by adaptive matrix S_{out}, Then add one to the right M × M M\times M M×M Matrix , All the elements are m a r g i n = π margin = π margin=π, Form an augmented matrix S o u t =   [ S o u t ,   π   ×   1 M × M ] S_{out}=\ [S_{out},\ \pi\ \times\ 1_{M\times M}] Sout= [Sout, π × 1M×M], And then use the Hungarian algorithm to match .

### 3.3.1. Matching associations and the emergence of new targets

Suppose Hungary's output is （ i , j ） （i,j） i,j, i i i On behalf of the line , j j j Representative column . If j < N j<N j<N, be i i i and j j j The trajectory matching of , j j j Of i d id id Just assign it to i i i; otherwise i It's a new goal , then i i i Of i d id id for m a x { i d }   +   1 max\{id\}\ +\ 1 max{ id} + 1.

### 3.3.2. The target disappears

Suppose the number of past frames selected is 10, If a goal is next 10 None of the frames matched successfully , Then the goal is considered to be gone .

# Four 、 summary

MOTChalleng Up to SOTA, Great improvement . This work further studies how to extract better object features , For better data association . First use position information in the current frame to predict the past frame , And extract features , They are input into the adaptive graph network together with the features detected in the current frame , Update features by integrating global location and appearance information of time and space . At the same time, it puts forward a balance MSE The loss is for training , Can balance the weight of loss between different types of targets more , In order to make neural network better data association .