# 从线性回归来理解正则化

2021-06-22 00:00:36

$$x_{m \times n} = (x_{1}, x_{2},...,x_{i},...,x_{m})^T$$

$$T=((x_{1}, y_{1}),(x_{2}, y_{2}),...,(x_{m}, y_{m}))$$

$$xw = y$$

$$Loss = \dfrac{1}{2} \sum (y_{i}-x_{i}w)^{2} = \dfrac{1}{2} (y-xw)^{2}$$

$$\dfrac{\partial Loss}{\partial w} = \dfrac{\partial (y-xw)^{T}(y-xw)}{\partial w} = 2x^{T}(xw-y) = 0$$

$$\theta_{1 \times hw} = (xx^{T})^{-1}xy^{T}$$

$$xx^{T}$$不为满秩，则不存在逆。由于深度学习需要大量的外部图像样本不断学习参数$$\theta$$，相当于网络在求解这$$n$$多个非线性方程组。若方程组的个数小于特征的维度，那么得到的解的假设空间也就越多，在$$Loss$$的约束下很容易学到过拟合的解。解决这一问题的根本办法就是使样本个数远大于特征维度，这样就可以约束解空间，更容易得到较好的模型。要达到这种效果有很多：

1）增加样本数（数据增广等）2）减少特征个数（dropout，PCA降维等）3）加正则化进行约束

$$Loss = \dfrac{1}{2} \sum (y_{i}-x_{i}w)^{2} + \lambda w^{T}w= \dfrac{1}{2} (y-xw)^{2} + \lambda w^{T}w$$

$$\dfrac{\partial Loss}{\partial w} = 2(x^{T}x+\lambda I)-2x^{T}y = 0$$

$$y = xw + \delta$$，并且：$$y \sim N(xw, \sigma^{2})$$

$$p(y|x,w) = \dfrac{1}{\sigma \sqrt{2\pi}} exp (-\dfrac{(y-xw)^{T}(y-xw)}{2\sigma^{2}})$$

$$MLE = log \prod p(y_{i}|x_{i}, w)=\sum log p(y_{i}|x_{i}, w) = \sum log \dfrac{1}{\sigma \sqrt{2\pi}}-\dfrac{(y-xw)^{T}(y-xw)}{2\sigma^{2}}$$

$$w = argmax(MLE) = argmin(y-xw)^{T}(y-xw)$$，与前文定义的经验风险损失一致。

$$w = argmax(MAP) = argmax(p(y|x,w)p(w))$$，进一步可得：

$$p(y|x,w)p(w) = \dfrac{1}{\sigma \sqrt{2\pi}} exp (-\dfrac{(y-xw)^{T}(y-xw)}{2\sigma^{2}}) \times \dfrac{1}{\sigma_{w} \sqrt{2\pi}} exp (-\dfrac{w^{T}w}{2\sigma_{w}^{2}})$$

$$w = argmax(MAP) = argmin((y-xw)^{T}(y-xw)+\dfrac{\sigma^{2}}{\sigma_{w}^{2}}w^{T}w)$$

$$\lambda = \dfrac{\sigma^{2}}{\sigma_{w}^{2}}$$，公式与前文定义的结构风险损失一致

https://www.cnblogs.com/zhaozhibo/p/14916406.html