# Hands-on Deep Learning_Batch Normalization

2022-08-06 07:07:23

训练深层神经网络是十分困难的,特别是在较短的时间内使他们收敛更加棘手.今天将介绍批量规范化（batch normalization）,这是一种流行且有效的技术,可持续加速深层网络的收敛速度.

Batch Normalization 是如何做到 Accelerated convergence？After reading the example below, you will know.

Suppose I now have a very simple linear model,它的输入是  跟 ,它对应的参数就是  跟 ,没有 activation function

当我们对  Add a small one  改变时, 也会随之改变,Then our error  There will be a fluctuation. 假设  The input is very small,那么,for this small fluctuation, has not changed so much,minimize will be a little easier.But if mine  很大,That same small one  改变,对于 Loss changes are very large.So for different inputs Loss The gradients are very different（图左）,But what we want to see is the situation on the right

纵坐标表示 ,横坐标表示

对于上述的情况（如图左）,Of course we have other ways to deal with this situation,Like ours , 固定的 learning rate It may be difficult to get good results,所以需要 adaptive 的 learning rate、 Adam 等等比较进阶的 optimization 的方法,才能够得到好的结果.

因为深层神经网络在做非线性变换前的激活输入值 x 随着网络深度加深或者在训练过程中,其分布逐渐发生偏移或者变动,之所以训练收敛慢,一般是整体分布逐渐往非线性函数（激活函数） The upper and lower limits of the value range are close to each other,所以这导致反向传播时低层神经网络的梯度消失,这是训练深层神经网络收敛越来越慢的本质原因.

# Feature Normalization

以下所讲的方法只是Feature Normalization 的一种情况,There are actually many different operations.

We take a piece of data in turn（batch > 1）of a certain dimension normalization （In fact, it should be called standardization）：clipping the mean,除以方差.In this way, all dimensions are numerically distributed0的上下,This will make a better one error surface（比较平滑）.像这样子 Feature Normalization 的方式,往往对 training 有帮助,它可以让 gradient descent 的时候, Loss 收敛更快一点,Training is a little smoother.

Of course not just for input batch 进行 normalization ,The output of this layer also becomes the input of the next layer,So you can do one after each output normalization ,As for the question of whether the position is placed before or after the activation function,Experiments have found that there is little difference,看你个人喜好.

1. 接下来把  ,Multiply by another vector 拉伸参数 scale  ： 和  对应元素相乘;
2. 再加上向量 偏移参数 shift  得到  .

Take one at a time batch 出来的时候,就得到  ···第 t 个 batch 就有  .然后根据公式：,计算出  和对应的  .

pytorch 代码实现：

def batch_norm(X, gamma, beta, moving_mean, moving_var, eps, momentum):
# 如果是在预测模式下,直接使用传入的移动平均所得的均值和方差
X_hat = (X - moving_mean) / torch.sqrt(moving_var + eps)
else:
assert len(X.shape) in (2, 4)
if len(X.shape) == 2:
# 使用全连接层的情况,计算特征维上的均值和方差
mean = X.mean(dim=0)
var = ((X - mean) ** 2).mean(dim=0)
else:
# 使用二维卷积层的情况,计算通道维上（axis=1）的均值和方差.
# 这里我们需要保持X的形状以便后面可以做广播运算
mean = X.mean(dim=(0, 2, 3), keepdim=True)
var = ((X - mean) ** 2).mean(dim=(0, 2, 3), keepdim=True)
# 训练模式下,用当前的均值和方差做标准化
X_hat = (X - mean) / torch.sqrt(var + eps)
# 更新移动平均的均值和方差
moving_mean = momentum * moving_mean + (1.0 - momentum) * mean
moving_var = momentum * moving_var + (1.0 - momentum) * var
Y = gamma * X_hat + beta  # 缩放和移位
return Y, moving_mean.data, moving_var.data

class BatchNorm(nn.Module):
# num_features：完全连接层的输出数量或卷积层的输出通道数.
# num_dims：2表示完全连接层,4表示卷积层
def __init__(self, num_features, num_dims):
super().__init__()
if num_dims == 2:
shape = (1, num_features)
else:
shape = (1, num_features, 1, 1)
# 参与求梯度和迭代的拉伸和偏移参数,分别初始化成1和0
self.gamma = nn.Parameter(torch.ones(shape))
self.beta = nn.Parameter(torch.zeros(shape))
# 非模型参数的变量初始化为0和1
self.moving_mean = torch.zeros(shape)
self.moving_var = torch.ones(shape)

def forward(self, X):
# 如果X不在内存上,将moving_mean和moving_var
# 复制到X所在显存上
if self.moving_mean.device != X.device:
self.moving_mean = self.moving_mean.to(X.device)
self.moving_var = self.moving_var.to(X.device)
# 保存更新过的moving_mean和moving_var
Y, self.moving_mean, self.moving_var = batch_norm(
X, self.gamma, self.beta, self.moving_mean,
self.moving_var, eps=1e-5, momentum=0.9)
return Y

https://chowdera.com/2022/218/202208060623513270.html