当前位置:网站首页>Hands-on Deep Learning_Batch Normalization

Hands-on Deep Learning_Batch Normalization

2022-08-06 07:07:23CV Small Rookie

        训练深层神经网络是十分困难的,特别是在较短的时间内使他们收敛更加棘手.今天将介绍批量规范化(batch normalization),这是一种流行且有效的技术,可持续加速深层网络的收敛速度.

        Batch Normalization 是如何做到 Accelerated convergence?After reading the example below, you will know.

        Suppose I now have a very simple linear model,它的输入是 x_{1} 跟 x_{2},它对应的参数就是 w_{1} 跟 w_{2},没有 activation function

        当我们对 w Add a small one \Delta w 改变时,y 也会随之改变,Then our error L There will be a fluctuation. 假设 x_{1} The input is very small,那么,for this small fluctuation,L has not changed so much,minimize Lwill be a little easier.But if mine x_{2} 很大,That same small one \Delta w 改变,对于 Loss changes are very large.So for different inputs Loss The gradients are very different(图左),But what we want to see is the situation on the right

 纵坐标表示 w_{1},横坐标表示 w_{2}

        对于上述的情况(如图左),Of course we have other ways to deal with this situation,Like ours x_{1},x_{2}, 固定的 learning rate It may be difficult to get good results,所以需要 adaptive 的 learning rate、 Adam 等等比较进阶的 optimization 的方法,才能够得到好的结果.

        因为深层神经网络在做非线性变换前的激活输入值 x 随着网络深度加深或者在训练过程中,其分布逐渐发生偏移或者变动,之所以训练收敛慢,一般是整体分布逐渐往非线性函数(激活函数) The upper and lower limits of the value range are close to each other,所以这导致反向传播时低层神经网络的梯度消失,这是训练深层神经网络收敛越来越慢的本质原因.

现在,Start from another angle,We are not changing learning rate 去适应 error surface ,而是直接修改 error surface to reduce the difference in gradients:xx normalization ( xx There are many ways to represent).

Feature Normalization

        以下所讲的方法只是Feature Normalization 的一种情况,There are actually many different operations.

        We take a piece of data in turn(batch > 1)of a certain dimension normalization (In fact, it should be called standardization):clipping the mean,除以方差.In this way, all dimensions are numerically distributed0的上下,This will make a better one error surface(比较平滑).像这样子 Feature Normalization 的方式,往往对 training 有帮助,它可以让 gradient descent 的时候, Loss 收敛更快一点,Training is a little smoother.

        Of course not just for input batch 进行 normalization ,The output of this layer also becomes the input of the next layer,So you can do one after each output normalization ,As for the question of whether the position is placed before or after the activation function,Experiments have found that there is little difference,看你个人喜好. 

在做 Batch Normalization 的时候,There are often such designs,当计算出 \tilde{z} 后:

  1. 接下来把 \tilde{z} ,Multiply by another vector 拉伸参数 scale \gamma :\tilde{z} 和 \gamma 对应元素相乘;
  2. 再加上向量 偏移参数 shift \beta 得到 \hat{z} .

这里的参数 \gamma 和 \beta,是网络参数,学出来的.Then why not add it \gamma 和 \beta 呢?

如果 normalization 以后,Its average may become 0 ,If the average is 0 的话,It is equivalent to adding certain restrictions to our network,This limitation may have some negative effects,所以我们通过 \gamma 和 \beta 对 \hat{z} Make some more adjustments.(通过 scale 和 shift Shift this value slightly from the standard normal distribution,每个实例挪动的程度不一样,这样等价于非线性函数的值从正中心周围的线性区往非线性区动了动.核心思想应该是想找到一个线性和非线性的较好平衡点,既能享受非线性的较强表达能力的好处,又避免太靠非线性区两头使得网络收敛速度太慢.)

当我们做测试的时候,batch 可能为 1 ,Then subtract the mean 0(The mean is itself).实际上 pytorch 已经帮我们考虑好了,在训练的时候,每一个 batch 计算出来 \mu ,\sigma will be used to count a call moving average 的东西.

Take one at a time batch 出来的时候,就得到 \mu^{1} ···第 t 个 batch 就有 \mu^{t} .然后根据公式:\bar{\mu}=p\bar{\mu}+(1-p)\mu^{t},计算出 \bar{\mu} 和对应的 \bar{\sigma } .

pytorch 代码实现:

def batch_norm(X, gamma, beta, moving_mean, moving_var, eps, momentum):
    # 通过is_grad_enabled来判断当前模式是训练模式还是预测模式
    if not torch.is_grad_enabled():
        # 如果是在预测模式下,直接使用传入的移动平均所得的均值和方差
        X_hat = (X - moving_mean) / torch.sqrt(moving_var + eps)
    else:
        assert len(X.shape) in (2, 4)
        if len(X.shape) == 2:
            # 使用全连接层的情况,计算特征维上的均值和方差
            mean = X.mean(dim=0)
            var = ((X - mean) ** 2).mean(dim=0)
        else:
            # 使用二维卷积层的情况,计算通道维上(axis=1)的均值和方差.
            # 这里我们需要保持X的形状以便后面可以做广播运算
            mean = X.mean(dim=(0, 2, 3), keepdim=True)
            var = ((X - mean) ** 2).mean(dim=(0, 2, 3), keepdim=True)
        # 训练模式下,用当前的均值和方差做标准化
        X_hat = (X - mean) / torch.sqrt(var + eps)
        # 更新移动平均的均值和方差
        moving_mean = momentum * moving_mean + (1.0 - momentum) * mean
        moving_var = momentum * moving_var + (1.0 - momentum) * var
    Y = gamma * X_hat + beta  # 缩放和移位
    return Y, moving_mean.data, moving_var.data

class BatchNorm(nn.Module):
    # num_features:完全连接层的输出数量或卷积层的输出通道数.
    # num_dims:2表示完全连接层,4表示卷积层
    def __init__(self, num_features, num_dims):
        super().__init__()
        if num_dims == 2:
            shape = (1, num_features)
        else:
            shape = (1, num_features, 1, 1)
        # 参与求梯度和迭代的拉伸和偏移参数,分别初始化成1和0
        self.gamma = nn.Parameter(torch.ones(shape))
        self.beta = nn.Parameter(torch.zeros(shape))
        # 非模型参数的变量初始化为0和1
        self.moving_mean = torch.zeros(shape)
        self.moving_var = torch.ones(shape)

    def forward(self, X):
        # 如果X不在内存上,将moving_mean和moving_var
        # 复制到X所在显存上
        if self.moving_mean.device != X.device:
            self.moving_mean = self.moving_mean.to(X.device)
            self.moving_var = self.moving_var.to(X.device)
        # 保存更新过的moving_mean和moving_var
        Y, self.moving_mean, self.moving_var = batch_norm(
            X, self.gamma, self.beta, self.moving_mean,
            self.moving_var, eps=1e-5, momentum=0.9)
        return Y

当然 pytorch The corresponding functions have been encapsulated,只需要调用  nn.BatchNorm1d(x)/nn.BatchNorm2d(x) x 是输入的通道数 1d 和 2d Corresponding to the fully connected layer and the convolutional layer.

原网站

版权声明
本文为[CV Small Rookie]所创,转载请带上原文链接,感谢
https://chowdera.com/2022/218/202208060623513270.html

随机推荐