# Multi task learning using uncertainty to weight losses for scene geometry and semantics

2020-12-08 11:34:47

# One 、 Abstract

（ There is no specific research case segmentation , So this blog doesn't care too much about the model and task of segmentation ）
Some deep networks require multitasking for multiple regression and classification objectives at the same time , After observation, it is found that giving a loss weight to different tasks greatly affects the performance of the overall task . But it is extremely difficult to search this weight manually and consume resources , Therefore, this paper proposes a method to learn these weights by considering the uncertainty of the homovariance of each task , This allows us to learn different parameters of different units or scales simultaneously in classification and regression settings . The multi task learning model can prove that , And it is better than single task independent modeling training .

innovation ：

1. A novel multitasking loss that uses the same variance uncertainty to learn the classification and regression losses of different numbers and units simultaneously
2. Establish a unified combination of semantic segmentation 、 Positioning segmentation and deep regression Architecture
3. Prove the importance of model loss weight , And can solve and get good parameters .

# Two 、 Introduce

The goal of multi task learning is to improve learning efficiency and prediction accuracy by learning multiple objectives from a shared representation .

Previous approaches to learning multiple tasks at the same time used a simple weighted sum of losses , The weight of the loss is consistent , Or manually adjust . But we found that the performance of this model depends largely on these weights . Manual fine tuning and search to optimize the network is expensive and difficult to solve the essential problem , The optimal weight of each task depends on the measurement scale ( For example, rice 、 Centimeter or millimeter ) And the size of the final task noise .

The author interprets homovariance uncertainty as task related weighting , And show how to derive a principled multitasking loss function , This function can learn to balance various regression and classification losses .

# 3、 ... and 、 Multi task learning

A brief expression of the previous main task ：
L total = ∑ i w i L i (1) L_{\text {total}}=\sum_{i} w_{i} L_{i} \tag{1} Ltotal=iwiLi(1)
It's just a simple linear sum for each subtask , But there are many problems with this , For example, the model is sensitive to weight parameters , Here's the picture 2:

## 1. Homovariance uncertainty

In the Bayesian model , There are two main types of uncertainty ：

1. Cognitive uncertainty It's the uncertainty in the model , It captures what our model doesn't know because of the lack of training data , This can be explained by the increased training data .
2. Arbitrary uncertainty It captures the uncertainty of information that we can't explain with data . Any uncertainty can be explained by the ability to observe all explanatory variables , And it's getting more accurate .

Arbitrary uncertainty It can be divided into two kinds ：

1. Dependent on data or heteroscedasticity uncertainty , It depends on the input data , And predicted as model output .
2. Task dependence or covariance uncertainty , It doesn't depend on input data . It's not an output model , It's about keeping all the input data unchanged , And the amount of variation between tasks . therefore , It can be described as the uncertainty of task dependency

In the problem of multi task learning , We can use the same variance uncertainty as the basis of weighted loss .

## 2. Multitask likelihood regression

In this section , A multi task loss function based on Gaussian likelihood maximization of the same variance uncertainty is derived . Give Way f W ( x ) f^W (x) fW(x) Indicates input x x x Based on parameters W W W The output of the neural network . Define the following probability model , For the return mission , Probability is defined as a Gaussian distribution , The mean value is given by the model output ：
p ( y ∣ f W ( x ) ) = N ( f W ( x ) , σ 2 ) (2) p\left(\mathbf{y} \mid \mathbf{f}^{\mathbf{W}}(\mathbf{x})\right)=\mathcal{N}\left(\mathbf{f}^{\mathbf{W}}(\mathbf{x}), \sigma^{2}\right)\tag{2} p(yfW(x))=N(fW(x),σ2)(2)

Noise scalar σ σ σ, Represents how much noise the output needs to carry .

In order to classify , We often pass through s o f t m a x softmax softmax Function compression model output , The sample from the probability vector ：
p ( y ∣ f W ( x ) ) = Softmax ⁡ ( f W ( x ) ) (3) p\left(\mathbf{y} \mid \mathbf{f}^{\mathbf{W}}(\mathbf{x})\right)=\operatorname{Softmax}\left(\mathbf{f}^{\mathbf{W}}(\mathbf{x})\right) \tag{3} p(yfW(x))=Softmax(fW(x))(3)
In the case of multiple model outputs , The possibility of factoring the output is usually defined , Give some sufficient statistical information . We define f W ( x ) f^W (x) fW(x) As our sufficient statistics , The following multi task likelihood is obtained ：
p ( y 1 , … , y K ∣ f W ( x ) ) = p ( y 1 ∣ f W ( x ) ) … p ( y K ∣ f W ( x ) ) (4) p\left(\mathbf{y}_{1}, \ldots, \mathbf{y}_{K} \mid \mathbf{f}^{\mathbf{W}}(\mathbf{x})\right)=p\left(\mathbf{y}_{1} \mid \mathbf{f}^{\mathbf{W}}(\mathbf{x})\right) \ldots p\left(\mathbf{y}_{K} \mid \mathbf{f}^{\mathbf{W}}(\mathbf{x})\right) \tag{4} p(y1,,yKfW(x))=p(y1fW(x))p(yKfW(x))(4)

y 1 , … , y K y_1,…,y_K y1,,yK For the model （ For example, semantic segmentation , Deep regression ） Output .

In the derivation of maximum likelihood , Maximize the log likelihood of the model .

for example Return to the task in , Log likelihood It can be expressed as ：
log ⁡ p ( y ∣ f W ( x ) ) ∝ − 1 2 σ 2 ∥ y − f W ( x ) ∥ 2 − log ⁡ σ (5) \log p\left(\mathbf{y} \mid \mathbf{f}^{\mathbf{W}}(\mathbf{x})\right) \propto-\frac{1}{2 \sigma^{2}}\left\|\mathbf{y}-\mathbf{f}^{\mathbf{W}}(\mathbf{x})\right\|^{2}-\log \sigma \tag{5} logp(yfW(x))2σ21yfW(x)2logσ(5)

then , We Maximization and model parameters W W W And noise parameters σ σ σ About the log likelihood .

Suppose the output of the model consists of two vectors y 1 y_1 y1 and y 2 y_2 y2 form （ Two return missions ）, Obey Gauss distribution ：
p ( y 1 , y 2 ∣ f W ( x ) ) = p ( y 1 ∣ f W ( x ) ) ⋅ p ( y 2 ∣ f W ( x ) ) = N ( y 1 ; f W ( x ) , σ 1 2 ) ⋅ N ( y 2 ; f W ( x ) , σ 2 2 ) (6) \begin{aligned} p\left(\mathbf{y}_{1}, \mathbf{y}_{2} \mid \mathbf{f}^{\mathbf{W}}(\mathbf{x})\right) &=p\left(\mathbf{y}_{1} \mid \mathbf{f}^{\mathbf{W}}(\mathbf{x})\right) \cdot p\left(\mathbf{y}_{2} \mid \mathbf{f}^{\mathbf{W}}(\mathbf{x})\right) \\ &=\mathcal{N}\left(\mathbf{y}_{1} ; \mathbf{f}^{\mathbf{W}}(\mathbf{x}), \sigma_{1}^{2}\right) \cdot \mathcal{N}\left(\mathbf{y}_{2} ; \mathbf{f}^{\mathbf{W}}(\mathbf{x}), \sigma_{2}^{2}\right) \end{aligned} \tag{6} p(y1,y2fW(x))=p(y1fW(x))p(y2fW(x))=N(y1;fW(x),σ12)N(y2;fW(x),σ22)(6)

This leads to our multitasking model Minimize goals L ( W , σ 1 , σ 2 ) \mathcal{L}(W,σ_1,σ_2) L(W,σ1,σ2)
= − log ⁡ p ( y 1 , y 2 ∣ f W ( x ) ) ∝ 1 2 σ 1 2 ∥ y 1 − f W ( x ) ∥ 2 + 1 2 σ 2 2 ∥ y 2 − f W ( x ) ∥ 2 + log ⁡ σ 1 σ 2 = 1 2 σ 1 2 L 1 ( W ) + 1 2 σ 2 2 L 2 ( W ) + log ⁡ σ 1 σ 2 (7) \begin{aligned} &=-\log p\left(\mathbf{y}_{1}, \mathbf{y}_{2} \mid \mathbf{f}^{\mathbf{W}}(\mathbf{x})\right)\\ &\propto \frac{1}{2 \sigma_{1}^{2}}\left\|\mathbf{y}_{1}-\mathbf{f}^{\mathbf{W}}(\mathbf{x})\right\|^{2}+\frac{1}{2 \sigma_{2}^{2}}\left\|\mathbf{y}_{2}-\mathbf{f}^{\mathbf{W}}(\mathbf{x})\right\|^{2}+\log \sigma_{1} \sigma_{2}\\ &=\frac{1}{2 \sigma_{1}^{2}} \mathcal{L}_{1}(\mathbf{W})+\frac{1}{2 \sigma_{2}^{2}} \mathcal{L}_{2}(\mathbf{W})+\log \sigma_{1} \sigma_{2} \end{aligned} \tag{7} =logp(y1,y2fW(x))2σ121y1fW(x)2+2σ221y2fW(x)2+logσ1σ2=2σ121L1(W)+2σ221L2(W)+logσ1σ2(7)
L 1 ( W ) = ∥ y 1 − f w ( x ) ∥ 2 \mathcal{L}_{1}(\mathbf{W})=\left\|\mathbf{y}_{1}-\mathbf{f}^{\mathbf{w}}(\mathbf{x})\right\|^{2} L1(W)=y1fw(x)2 Represents the first output variable y 1 y_1 y1 The loss of , L 2 \mathcal{L}_{2} L2 Empathy .

Parameters σ 1 , σ 2 σ_1,σ_2 σ1,σ2 It's a loss L 1 ( W ) \mathcal L_1 (W) L1(W) and L 2 ( W ) \mathcal L_2 (W) L2(W) Adaptive weights based on data , Because when σ 1 σ_1 σ1（ Variable y 1 y_1 y1 The noise of ） increases , L 1 ( W ) \mathcal L_1 (W) L1(W) The weight of . The corresponding noise reduction , The corresponding loss weight increases . The loss of the last term acts as a regulator of noise , Suppression of excessive increase in noise .

This structure can Simply extend to multiple regression output .

about Classification function , We do this by scaling the output s o f t m a x softmax softmax Function as classification likelihood ：
p ( y ∣ f W ( x ) , σ ) = Softmax ⁡ ( 1 σ 2 f W ( x ) ) (8) p\left(\mathbf{y} \mid \mathbf{f}^{\mathbf{W}}(\mathbf{x}), \sigma\right)=\operatorname{Softmax}\left(\frac{1}{\sigma^{2}} \mathbf{f}^{\mathbf{W}}(\mathbf{x})\right) \tag{8} p(yfW(x),σ)=Softmax(σ21fW(x))(8)
σ σ σ Is a positive scaling value . You can think of it as one B o l t z − m a n n Boltz- mann Boltzmann Distribution （ G i b b s Gibbs Gibbs Distribution ）, Input by σ 2 σ^2 σ2 Zoom . This zoom can be fixed or learned , Where the size of the parameters determines the discrete distribution of “ uniform ”( flat ) Degree of . The log likelihood of this output can be written as ：
log ⁡ p ( y = c ∣ f W ( x ) , σ ) = 1 σ 2 f c W ( x ) − log ⁡ ∑ c ′ exp ⁡ ( 1 σ 2 f c ′ W ( x ) ) (9) \begin{aligned} \log p\left(\mathbf{y}=c \mid \mathbf{f}^{\mathbf{W}}(\mathbf{x}), \sigma\right) &=\frac{1}{\sigma^{2}} f_{c}^{\mathbf{W}}(\mathbf{x}) \\ &-\log \sum_{c^{\prime}} \exp \left(\frac{1}{\sigma^{2}} f_{c^{\prime}}^{\mathbf{W}}(\mathbf{x})\right) \end{aligned} \tag{9} logp(y=cfW(x),σ)=σ21fcW(x)logcexp(σ21fcW(x))(9)
f c W ( x ) f_c^W (x) fcW(x) It's a vector f W ( x ) f^W (x) fW(x) Of the c c c Elements .

Suppose that the multiple outputs of the model consist of one continuous output y 1 y_1 y1 And discrete output y 2 y_2 y2 form , Gaussian likelihood and s o f t m a x softmax softmax Likelihood modeling for joint loss L ( W , σ 1 , σ 2 ) \mathcal L(W,σ_1, σ_2) L(W,σ1,σ2)
= − log ⁡ p ( y 1 , y 2 = c ∣ f W ( x ) ) = − log ⁡ N ( y 1 ; f W ( x ) , σ 1 2 ) ⋅ Softmax ⁡ ( y 2 = c ; f W ( x ) , σ 2 ) = 1 2 σ 1 2 ∥ y 1 − f W ( x ) ∥ 2 + log ⁡ σ 1 − log ⁡ p ( y 2 = c ∣ f W ( x ) , σ 2 ) = 1 2 σ 1 2 L 1 ( W ) + 1 σ 2 2 L 2 ( W ) + log ⁡ σ 1 + log ⁡ ∑ c ′ exp ⁡ ( 1 σ 2 2 f c ′ W ( x ) ) ( ∑ c ′ exp ⁡ ( f c ′ W ( x ) ) ) 1 σ 2 2 ≈ 1 2 σ 1 2 L 1 ( W ) + 1 σ 2 2 L 2 ( W ) + log ⁡ σ 1 + log ⁡ σ 2 (10) \begin{array}{l} =-\log p\left(\mathbf{y}_{1}, \mathbf{y}_{2}=c \mid \mathbf{f}^{\mathbf{W}}(\mathbf{x})\right) \\ =-\log \mathcal{N}\left(\mathbf{y}_{1} ; \mathbf{f}^{\mathbf{W}}(\mathbf{x}), \sigma_{1}^{2}\right) \cdot \operatorname{Softmax}\left(\mathbf{y}_{2}=c ; \mathbf{f}^{\mathbf{W}}(\mathbf{x}), \sigma_{2}\right) \\ =\frac{1}{2 \sigma_{1}^{2}}\left\|\mathbf{y}_{1}-\mathbf{f}^{\mathbf{W}}(\mathbf{x})\right\|^{2}+\log \sigma_{1}-\log p\left(\mathbf{y}_{2}=c \mid \mathbf{f}^{\mathbf{W}}(\mathbf{x}), \sigma_{2}\right) \\ =\frac{1}{2 \sigma_{1}^{2}} \mathcal{L}_{1}(\mathbf{W})+\frac{1}{\sigma_{2}^{2}} \mathcal{L}_{2}(\mathbf{W})+\log \sigma_{1} \\ \quad+\log \frac{\sum_{c^{\prime}} \exp \left(\frac{1}{\sigma_{2}^{2}} f_{c^{\prime}}^{\mathbf{W}}(\mathbf{x})\right)}{\left(\sum_{c^{\prime}} \exp \left(f_{c^{\prime}} \mathbf{W}(\mathbf{x})\right)\right)^{\frac{1}{\sigma_{2}^{2}}}} \\ \approx \frac{1}{2 \sigma_{1}^{2}} \mathcal{L}_{1}(\mathbf{W})+\frac{1}{\sigma_{2}^{2}} \mathcal{L}_{2}(\mathbf{W})+\log \sigma_{1}+\log \sigma_{2} \end{array} \tag{10} =logp(y1,y2=cfW(x))=logN(y1;fW(x),σ12)Softmax(y2=c;fW(x),σ2)=2σ121y1fW(x)2+logσ1logp(y2=cfW(x),σ2)=2σ121L1(W)+σ221L2(W)+logσ1+log(cexp(fcW(x)))σ221cexp(σ221fcW(x))2σ121L1(W)+σ221L2(W)+logσ1+logσ2(10)

L 1 ( W ) = ∥ y 1 − f w ( x ) ∥ 2 \mathcal{L}_{1}(\mathbf{W})=\left\|\mathbf{y}_{1}-\mathbf{f}^{\mathbf{w}}(\mathbf{x})\right\|^{2} L1(W)=y1fw(x)2 Express y 1 y_1 y1 Euclid's loss of , L 2 ( W ) = − log ⁡ S o f t m a x ( y 2 , f W ( x ) ) \mathcal L_2(W)=-\log Softmax(y_2,f^W(x)) L2(W)=logSoftmax(y2,fW(x)) For about y 2 y_2 y2 The cross entropy loss of （ f W ( x ) f^W(x) fW(x) Not scaled ）. Use W W W σ 1 σ_1 σ1 and σ 1 σ_1 σ1 To optimize . final ≈ ≈ , When σ 2 → 1 σ_2→1 σ21 It's time to get closer to = = =.

The last goal can be seen as learning the relative weight of each output loss . Scale value σ 2 σ_2 σ2 The bigger it is , Then we lose L 2 ( W ) \mathcal L_2 (W) L2(W) The smaller the contribution , Scaling is affected by log ⁡ σ 2 \log σ_2 logσ2 standard , The zoom is too big , Then the target will be punished .

such , This function can be arbitrarily composed of discrete and continuous loss functions , The loss is smooth and differentiable , And it's a good form , So that the task weight will not converge to zero , differ ( 1 ) (1) (1) It is possible to quickly converge the loss to 0.

In the experiments , Train the network to predict l o g log log Variable s ∶ = l o g σ 2 s{∶=} log σ_2 s=logσ2, Because this variable is more than direct regression σ 2 σ_2 σ2 More stable in Mathematics , The loss is avoided by dividing by 0 0 0.

# Four 、 Conclusion

It is proved that the correct weighting of loss items is very important for multi task learning , At the same time, it is proved that the same variance ( Mission ) Uncertainty is an effective method of loss weighting .

https://chowdera.com/2020/12/20201208113425241m.html