## ENet:A deep Neural Network Architecture for Real-Time Semasntic Sefmentation

## paper

### Abstract

The ability to perform pixel-wise semantic segmentation in real-time is of paramount importance in mobile applications. Recent deep neural networks aimed at this task have the disadvantage of requiring a large number of floating point operations and have long run-times that hinder their usability. In this paper, we propose a novel deep neural network architecture named ENet (efficient neural network), created specifically for tasks requiring low latency operation. ENet is up to 18× faster, requires 75× less FLOPs, has 79× less parameters, and provides similar or better accuracy to existing models. We have tested it on CamVid, Cityscapes and SUN datasets and report on comparisons with existing state-of-the-art methods, and the trade-offs between accuracy and processing time of a network. We present performance measurements of the proposed architecture on embedded systems and suggest possible software improvements that could make ENet even faster.

#### point

1、实时分割任务有广泛的应用背景；

2、提出ENet(efficient neural network)；

3、ENet速度快且有不错的精度

4、CamVid、Cityscapes和SUN RGBD数据集中验证模型

### 1 Introduction

Recent interest in augmented reality wearables, home-automation devices, and self-driving vehicles has created a strong need for semantic-segmentation (or visual scene-understanding) algorithms that can operate in real-time on low-power mobile devices. These algorithms label each and every pixel in the image with one of the object classes. In recent years, the availability of larger datasets and computationally-powerful machines have helped deep convolutional neural networks (CNNs) [1, 2, 3, 4] surpass the performance of many conventional computer vision algorithms [5, 6, 7]. Even though CNNs are increasingly successful at classification and categorization tasks, they provide coarse spatial results when applied to pixel-wise labeling of images. Therefore, they are often cascaded with other algorithms to refine the results, such as color based segmentation [8] or conditional random fields [9], to name a few.

In order to both spatially classify and finely segment images, several neural network architectures have been proposed, such as SegNet [10, 11] or fully convolutional networks [12]. All these works are based on a VGG16 [13] architecture, which is a very large model designed for multi-class classification. These references propose networks with huge numbers of parameters, and long inference times. In these conditions, they become unusable for many mobile or battery-powered applications, which require processing images at rates higher than 10 fps.

In this paper, we propose a new neural network architecture optimized for fast inference and high accuracy. Examples of images segmented using ENet are shown in Figure 1. In our work, we chosenot to use any post-processing steps, which can of course be combined with our method, but would worsen the performance of an end-to-end CNN approach.

In Section 3 we propose a fast and compact encoder-decoder architecture named ENet. It has been designed according to rules and ideas that have appeared in the literature recently, all of which we discuss in Section 4. Proposed network has been evaluated on Cityscapes [14] and CamVid [15] for driving scenario, whereas SUN dataset [16] has been used for testing our network in an indoor situation. We benchmark it on NVIDIA Jetson TX1 Embedded Systems Module as well as on an NVIDIA Titan X GPU. The results can be found in Section 5.

c

#### point

1、实时分割需求大

2、本文提出了一个轻量级网络，可以用于实时分割，且有不错的精度

3、章节安排及实验设置

#### conditional random fields（CRF,条件随机场）

CRF是一种无向模型，其结合了最大熵模型和隐马尔科夫模型的特点，常用于标注或分析序列。

##### 最大熵模型

###### Largrange乘子法

###### 贝叶斯公式

##### 熵（entropy）

熵是用来表示随机变量不确定性的度量。

H(x)依赖于X的分布,而与X的具体值无关。所以我们经常用H§来表示H(X), H(X)越大,表示X的不确定性越大。

##### 条件熵

##### 最大熵模型（MaxEnt）

思想核心：最大熵原理指出，对一个随机事件的概率分布进行预测时，预测应当满足全部已知的约束，而对未知的情况不要做任何主观假设。这种情况下，概率分布是均与的，预测风险小，因此得到的概率分布的熵是最大的。

###### MaxEnt的定义

经验分布：经验分布是指通过训练数据T上进行统计得到的分布。我们需要考察两个经验分布，分别是xy的联合经验分布以及x的分布

约束条件

##### 隐马尔科夫模型（Hidden Markov Model,HMM）

HMM是关于时序的概率模型，描述由一个隐藏的Markv链随机生成不可观测的状态随机序列。

Markv这个很简单，就是假设一个随机过程中后一刻的状态仅与前一刻的状态有关。

HMM是由Markv生成随机不可观测的随机状态序列，再由各个状态生成可观测的随机序列

### 2 Related work

Semantic segmentation is important in understanding the content of images and finding target objects.

This technique is of utmost importance in applications such as driving aids and augmented reality.

Moreover, real-time operation is a must for them, and therefore, designing CNNs carefully is vital.

Contemporary computer vision applications extensively use deep neural networks, which are now one of the most widely used techniques for many different tasks, including semantic segmentation.

This work presents a new neural network architecture, and therefore we aim to compare to other literature that performs the large majority of inference in the same way.

State-of-the-art scene-parsing CNNs use two separate neural network architectures combined together: an encoder and a decoder. Inspired by probabilistic auto-encoders [17, 18], encoder-decoder network architecture has been introduced in SegNet-basic [10], and further improved in SegNet [11]. The encoder is a vanilla CNN (such as VGG16 [13]) which is trained to classify the input, while the decoder is used to upsample the output of the encoder [12, 19, 20, 21, 22]. However, these networks are slow during inference due to their large architectures and numerous parameters. Unlike in fully convolutional networks (FCN) [12], fully connected layers of VGG16 were discarded in the latest incarnation of SegNet, in order to reduce the number of floating point operations and memory footprint, making it the smallest of these networks. Still, none of them can operate in real-time.

Other existing architectures use simpler classifiers and then cascade them with Conditional Random Field (CRF) as a post-processing step [9, 23]. As shown in [11], these techniques use onerous post-processing steps and often fail to label the classes that occupy fewer number of pixels in a frame.

CNNs can be also combined with recurrent neural networks [20] to improve accuracy, but then they suffer from speed degradation. Also, one has to keep in mind that RNN, used as a post-processing step, can be used in conjunction with any other technique, including the one presented in this work.

#### point

1、将语义分割的关键定义为理解图像内容和寻找目标对象，同时实时性也是一大研究重点。

2、编解码器结构被广泛使用。

3、CRF等后处理常被应用在CNN后，用于结果修饰。

### 3 Network architecture（重点）

The architecture of our network is presented in Table 1. It is divided into several stages, as highlighted by horizontal lines in the table and the first digit after each block name. Output sizes are reported for an example input image resolution of 512 × 512. We adopt a view of ResNets [24] that describes them as having a single main branch and extensions with convolutional filters that separate from it,and then merge back with an element-wise addition, as shown in Figure 2b. Each block consists of three convolutional layers: a 1 × 1 projection that reduces the dimensionality, a main convolutional layer (conv in Figure 2b), and a 1 × 1 expansion. We place Batch Normalization [25] and PReLU [26] between all convolutions. Just as in the original paper, we refer to these as bottleneck modules. If the bottleneck is downsampling, a max pooling layer is added to the main branch.

Also, the first 1 × 1 projection is replaced with a 2 × 2 convolution with stride 2 in both dimensions. We zero pad the activations, to match the number of feature maps. conv is either a regular, dilated or full convolution (also known as deconvolution or fractionally strided convolution) with 3 × 3 filters. Sometimes we replace it with asymmetric convolution i.e. a sequence of 5 × 1 and 1 × 5 convolutions. For the regularizer, we use Spatial Dropout [27], with p = 0.01 before bottleneck2.0, and p = 0.1 afterwards.

The initial stage contains a single block, that is presented in Figure 2a. Stage 1 consists of 5 bottleneck blocks, while stage 2 and 3 have the same structure, with the exception that stage 3 does not downsample the input at the beginning (we omit the 0th bottleneck). These three first stages are the encoder. Stage 4 and 5 belong to the decoder.

We did not use bias terms in any of the projections, in order to reduce the number of kernel calls and overall memory operations, as cuDNN [29] uses separate kernels for convolution and bias addition. This choice didn’t have any impact on the accuracy. Between each convolutional layer and following non-linearity we use Batch Normalization [25]. In the decoder max pooling is replaced with max unpooling, and padding is replaced with spatial convolution without bias. We did not use pooling indices in the last upsampling module, because the initial block operated on the 3 channels of the input frame, while the final output has C feature maps (the number of object classes). Also, for performance reasons, we decided to place only a bare full convolution as the last module of the network, which alone takes up a sizeable portion of the decoder processing time.

#### point

我们的网络架构如表1所示。它分为几个阶段，如表中的水平线和每个块名称后的第一个数字突出显示的那样。以512 × 512为输出尺寸。采用ResNets结构，将其描述为具有单一的主分支和扩展，并与卷积分离，然后用元素级添加合并回来，如图2b所示。每个块由三个卷积层组成:一个1 × 1的投影，用于降低维度，一个主卷积层(图2b中的conv)，以及一个1 × 1的扩展。在所有卷积之间放置批量归一化[25]和PReLU[26]。就像在原始论文中一样，我们将这些称为瓶颈模块。如果瓶颈是下采样，则在主分支上添加最大池化层。

1、网络包含两种模块，initial block和 bottleneck，网络采用PReLu作为激活函数。

2、initial block是并行执行的stride=2的conv和maxpooling，用于下采样

3、bottleneck 是仿照ResNet做的一种残差结构，文中派生出五中形态，每个卷积层以conv-bn-PReLU形式存在。

4、派生一：normal，无maxpooling和padding，conv的kernel_size=3

5、派生二：downsampling，有maxpooling和padding，1x1conv替换为kernel_size=2、stride=2的cnov

6、派生三：dilated，为conv设置空洞率

7、派生四：asymmetric，conv替换为（1,5）（5,1）的非对称卷积

8、派生五：upsampling，将maxpooling替换为unpooling

### 4 Design choices

In this section we will discuss our most important experimental results and intuitions, that have shaped the final architecture of ENet.

**Feature map resolution** Downsampling images during semantic segmentation has two main drawbacks. Firstly, reducing feature map resolution implies loss of spatial information like exact edge shape. Secondly, full pixel segmentation requires that the output has the same resolution as the input. This implies that strong downsampling will require equally strong upsampling, which increases model size and computational cost. The first issue has been addressed in FCN [12] by adding the feature maps produced by encoder, and in SegNet [10] by saving indices of elements chosen in max pooling layers, and using them to produce sparse upsampled maps in the decoder. We followed the SegNet approach, because it allows to reduce memory requirements. Still, we have found that strong downsampling hurts the accuracy, and tried to limit it as much as possible.

However, downsampling has one big advantage. Filters operating on downsampled images have a bigger receptive field, that allows them to gather more context. This is especially important when trying to differentiate between classes like, for example, rider and pedestrian in a road scene. It is not enough that the network learns how people look, the context in which they appear is equally important. In the end, we have found that it is better to use dilated convolutions for this purpose [30].

**Early downsampling** One crucial intuition to achieving good performance and real-time operation is realizing that processing large input frames is very expensive. This might sound very obvious, however many popular architectures do not to pay much attention to optimization of early stages of the network, which are often the most expensive by far.

ENet first two blocks heavily reduce the input size, and use only a small set of feature maps. The idea behind it, is that visual information is highly spatially redundant, and thus can be compressed into a more efficient representation. Also, our intuition is that the initial network layers should not directly contribute to classification. Instead, they should rather act as good feature extractors and only preprocess the input for later portions of the network. This insight worked well in our experiments; increasing the number of feature maps from 16 to 32 did not improve accuracy on Cityscapes [14] dataset.

**Decoder size** In this work we would like to provide a different view on encoder-decoder architectures than the one presented in [11]. SegNet is a very symmetric architecture, as the encoder is an exact mirror of the encoder. Instead, our architecture consists of a large encoder, and a small decoder.

This is motivated by the idea that the encoder should be able to work in a similar fashion to original classification architectures, i.e. to operate on smaller resolution data and provide for information processing and filtering. Instead, the role of the the decoder, is to upsample the output of the encoder, only fine-tuning the details.

**Initial layers** weights exhibit a large variance and are slightly biased towards positive values, while in the later portions of the encoder they settle to a recurring pattern. All layers in the main branch behave nearly exactly like regular ReLUs, while the weights inside bottleneck modules are negative i.e. the function inverts and scales down negative values. We hypothesize that identity did not work well in our architecture because of its limited depth. The reason why such lossy functions are learned might be that that the original ResNets [31] are networks that can be hundreds of layers deep, while our network uses only a couple of layers, and it needs to quickly filter out information. It is notable that the decoder weights become much more positive and learn functions closer to identity. This confirms our intuitions that the decoder is used only to fine-tune the upsampled output.

**Information-preserving** dimensionality changes As stated earlier, it is necessary to downsample the input early, but aggressive dimensionality reduction can also hinder the information flow. A very good approach to this problem has been presented in [28]. It has been argued that a method used by the VGG architectures, i.e. as performing a pooling followed by a convolution expanding the dimensionality, however relatively cheap, introduces a representational bottleneck (or forces one to use a greater number of filters, which lowers computational efficiency). On the other hand, pooling after a convolution, that increases feature map depth, is computationally expensive. Therefore, as proposed in [28], we chose to perform pooling operation in parallel with a convolution of stride 2, and concatenate resulting feature maps. This technique allowed us to speed up inference time of the initial block 10 times.

Additionally, we have found one problem in the original ResNet architecture. When downsampling, the first 1×1 projection of the convolutional branch is performed with a stride of 2 in both dimensions, which effectively discards 75% of the input. Increasing the filter size to 2 × 2 allows to take the full input into consideration, and thus improves the information flow and accuracy. Of course, it makes these layers 4× more computationally expensive, however there are so few of these in ENet, that the overhead is unnoticeable.

**Factorizing filters** It has been shown that convolutional weights have a fair amount of redundancy, and each n × n convolution can be decomposed into two smaller ones following each other: one with a n × 1 filter and the other with a 1 × n filter [32]. This idea has been also presented in [28], and from now on we adopt their naming convention and will refer to these as asymmetric convolutions.

We have used asymmetric convolutions with n = 5 in our network, so cost of these two operations is similar to a single 3 × 3 convolution. This allowed to increase the variety of functions learned by blocks and increase the receptive field.

What’s more, a sequence of operations used in the bottleneck module (projection, convolution, projection) can be seen as decomposing one large convolutional layer into a series of smaller and simpler operations, that are its low-rank approximation. Such factorization allows for large speedups, and greatly reduces the number of parameters, making them less redundant [32]. Additionally, it allows to make the functions they compute richer, thanks to the non-linear operations that are inserted between layers.

**Dilated convolutions** As argued above, it is very important for the network to have a wide receptive field, so it can perform classification by taking a wider context into account. We wanted to avoid overly downsampling the feature maps, and decided to use dilated convolutions [30] to improve our model. They replaced the main convolutional layers inside several bottleneck modules in the stages that operate on the smallest resolutions. These gave a significant accuracy boost, by raising IoU on Cityscapes by around 4 percentage points, with no additional cost. We obtained the best accuracy 5 when we interleaved them with other bottleneck modules (both regular and asymmetric), instead of arranging them in sequence, as has been done in [30].

**Regularization** Most pixel-wise segmentation datasets are relatively small (on order of 103 images), so such expressive models as neural networks quickly begin to overfit them. In initial experiments, we used L2 weight decay with little success. Then, inspired by [33], we have tried stochastic depth, which increased accuracy. However it became apparent that dropping whole branches (i.e. setting their output to 0) is in fact a special case of applying Spatial Dropout [27], where either all of the channels, or none of them are ignored, instead of selecting a random subset. We placed Spatial Dropout at the end of convolutional branches, right before the addition, and it turned out to work much better than stochastic depth.

#### point

1、在下采样过程会导致特征损失，同时过于简单粗暴的上采样方式，效果并不理想，diss FCN，因此，ENet采用SegNet中通过保存maxpooling中选择的元素的索引来解决，并使用其在解码器中生成稀疏的上采样映射。

2、编码器使用空洞卷积。

3、大输入成本高，ENet在编码器前两层极大减少数据冗余。

4、选取编解码器结构为大编码和小解码的组合，认为特征提取核心在于编码器

5、采用PReLU作为激活函数conv-bn-PReLU

6、因为数据集本身不大，很快会过拟合，使用L2效果不佳，最后选择Spatial Dropout，效果相对好一点

7、将n×n的卷积核拆为n×1和1×n，可以有效的减少参数量

8、Initial Block中，将Pooling操作和卷积操作并行，再concat到一起，这将推理阶段时间加速了10倍。同时在做下采样时，ENet使用2×2的卷积核，有效的改善了信息的流动和准确率。

## code

### InitialBlock

InitialBlock（a）的定义如下所示，在这部分定义两个分支，主分支为一个stride=2的3x3conv，负分支为stride=2的MaxPool2d，主分支贡献13个特征维度，负分支贡献3个，然后在通道叠加，这样图像尺寸减半，通道变为16

```
class InitialBlock(nn.Module):
"""The initial block is composed of two branches: 1. a main branch which performs a regular convolution with stride 2; 2. an extension branch which performs max-pooling. Doing both operations in parallel and concatenating their results allows for efficient downsampling and expansion. The main branch outputs 13 feature maps while the extension branch outputs 3, for a total of 16 feature maps after concatenation. Keyword arguments: - in_channels (int): the number of input channels. - out_channels (int): the number output channels. - kernel_size (int, optional): the kernel size of the filters used in the convolution layer. Default: 3. - padding (int, optional): zero-padding added to both sides of the input. Default: 0. - bias (bool, optional): Adds a learnable bias to the output if ``True``. Default: False. - relu (bool, optional): When ``True`` ReLU is used as the activation function; otherwise, PReLU is used. Default: True. """
def __init__(self,
in_channels,
out_channels,
bias=False,
relu=True):
super().__init__()
if relu:
activation = nn.ReLU
else:
activation = nn.PReLU
# Main branch - As stated above the number of output channels for this
# branch is the total minus 3, since the remaining channels come from
# the extension branch
self.main_branch = nn.Conv2d(
in_channels,
out_channels - 3,
kernel_size=3,
stride=2,
padding=1,
bias=bias)
# Extension branch
self.ext_branch = nn.MaxPool2d(3, stride=2, padding=1)
# Initialize batch normalization to be used after concatenation
self.batch_norm = nn.BatchNorm2d(out_channels)
# PReLU layer to apply after concatenating the branches
self.out_activation = activation()
def forward(self, x):
main = self.main_branch(x)
ext = self.ext_branch(x)
# Concatenate branches
out = torch.cat((main, ext), 1)
# Apply batch normalization
out = self.batch_norm(out)
return self.out_activation(out)
```

### Bottleneck（常规）

Bottleneck（b）定义如下所示（主路径没有任何定义），模块分为两个分支，主分支上没有任何操作，负分支上有控制通道变化的1x1conv，常规的或是空洞的再或是非对称conv，随机丢弃层。

【参数】

channels：输入输出通道数

internal_ratio：输入前后通道比率

kernel_size：卷积核大小

padding：padding

dilation：空洞率

asymmetric：是否替换为非对称卷积

dropout_prob：随机丢弃概率

bias：偏置

relu：true是relu，false是prelu

```
class RegularBottleneck(nn.Module):
"""Regular bottlenecks are the main building block of ENet. Main branch: 1. Shortcut connection. Extension branch: 1. 1x1 convolution which decreases the number of channels by ``internal_ratio``, also called a projection; 2. regular, dilated or asymmetric convolution; 3. 1x1 convolution which increases the number of channels back to ``channels``, also called an expansion; 4. dropout as a regularizer. Keyword arguments: - channels (int): the number of input and output channels. - internal_ratio (int, optional): a scale factor applied to ``channels`` used to compute the number of channels after the projection. eg. given ``channels`` equal to 128 and internal_ratio equal to 2 the number of channels after the projection is 64. Default: 4. - kernel_size (int, optional): the kernel size of the filters used in the convolution layer described above in item 2 of the extension branch. Default: 3. - padding (int, optional): zero-padding added to both sides of the input. Default: 0. - dilation (int, optional): spacing between kernel elements for the convolution described in item 2 of the extension branch. Default: 1. asymmetric (bool, optional): flags if the convolution described in item 2 of the extension branch is asymmetric or not. Default: False. - dropout_prob (float, optional): probability of an element to be zeroed. Default: 0 (no dropout). - bias (bool, optional): Adds a learnable bias to the output if ``True``. Default: False. - relu (bool, optional): When ``True`` ReLU is used as the activation function; otherwise, PReLU is used. Default: True. """
def __init__(self,
channels,
internal_ratio=4,
kernel_size=3,
padding=0,
dilation=1,
asymmetric=False,
dropout_prob=0,
bias=False,
relu=True):
super().__init__()
# Check in the internal_scale parameter is within the expected range
# [1, channels]
if internal_ratio <= 1 or internal_ratio > channels:
raise RuntimeError("Value out of range. Expected value in the "
"interval [1, {0}], got internal_scale={1}."
.format(channels, internal_ratio))
internal_channels = channels // internal_ratio
if relu:
activation = nn.ReLU
else:
activation = nn.PReLU
# Main branch - shortcut connection
# Extension branch - 1x1 convolution, followed by a regular, dilated or
# asymmetric convolution, followed by another 1x1 convolution, and,
# finally, a regularizer (spatial dropout). Number of channels is constant.
# 1x1 projection convolution
self.ext_conv1 = nn.Sequential(
nn.Conv2d(
channels,
internal_channels,
kernel_size=1,
stride=1,
bias=bias), nn.BatchNorm2d(internal_channels), activation())
# If the convolution is asymmetric we split the main convolution in
# two. Eg. for a 5x5 asymmetric convolution we have two convolution:
# the first is 5x1 and the second is 1x5.
if asymmetric:
self.ext_conv2 = nn.Sequential(
nn.Conv2d(
internal_channels,
internal_channels,
kernel_size=(kernel_size, 1),
stride=1,
padding=(padding, 0),
dilation=dilation,
bias=bias), nn.BatchNorm2d(internal_channels), activation(),
nn.Conv2d(
internal_channels,
internal_channels,
kernel_size=(1, kernel_size),
stride=1,
padding=(0, padding),
dilation=dilation,
bias=bias), nn.BatchNorm2d(internal_channels), activation())
else:
self.ext_conv2 = nn.Sequential(
nn.Conv2d(
internal_channels,
internal_channels,
kernel_size=kernel_size,
stride=1,
padding=padding,
dilation=dilation,
bias=bias), nn.BatchNorm2d(internal_channels), activation())
# 1x1 expansion convolution
self.ext_conv3 = nn.Sequential(
nn.Conv2d(
internal_channels,
channels,
kernel_size=1,
stride=1,
bias=bias), nn.BatchNorm2d(channels), activation())
self.ext_regul = nn.Dropout2d(p=dropout_prob)
# PReLU layer to apply after adding the branches
self.out_activation = activation()
def forward(self, x):
# Main branch shortcut
main = x
# Extension branch
ext = self.ext_conv1(x)# 减通道
ext = self.ext_conv2(ext)# asymmetric=true为非对称卷积，否则就是普通卷积
ext = self.ext_conv3(ext)# 增通道
ext = self.ext_regul(ext)# 随机丢弃层
# Add main and extension branches
out = main + ext
return self.out_activation(out)
```

### Bottleneck（下采样）

Bottleneck（b）定义如下所示（主路径含maxpool），该部分用于降低特征图尺寸，由两个分支构成，主分支上面是一个stride=2的maxpooling，同时其索引会存储用于后续的解码，负分支包括stride=2的2x2conv用于改变通道数、常规3x3conv、1x1conv用于改变通道数和随机丢弃层。

【参数】

in_channels ：输入通道数

out_channels ：输出通道数

internal_ratio ：同上

return_indices ：maxpooling的索引

dropout_prob、bias 、relu ：同上

```
class DownsamplingBottleneck(nn.Module):
"""Downsampling bottlenecks further downsample the feature map size. Main branch: 1. max pooling with stride 2; indices are saved to be used for unpooling later. Extension branch: 1. 2x2 convolution with stride 2 that decreases the number of channels by ``internal_ratio``, also called a projection; 2. regular convolution (by default, 3x3); 3. 1x1 convolution which increases the number of channels to ``out_channels``, also called an expansion; 4. dropout as a regularizer. Keyword arguments: - in_channels (int): the number of input channels. - out_channels (int): the number of output channels. - internal_ratio (int, optional): a scale factor applied to ``channels`` used to compute the number of channels after the projection. eg. given ``channels`` equal to 128 and internal_ratio equal to 2 the number of channels after the projection is 64. Default: 4. - return_indices (bool, optional): if ``True``, will return the max indices along with the outputs. Useful when unpooling later. - dropout_prob (float, optional): probability of an element to be zeroed. Default: 0 (no dropout). - bias (bool, optional): Adds a learnable bias to the output if ``True``. Default: False. - relu (bool, optional): When ``True`` ReLU is used as the activation function; otherwise, PReLU is used. Default: True. """
def __init__(self,
in_channels,
out_channels,
internal_ratio=4,
return_indices=False,
dropout_prob=0,
bias=False,
relu=True):
super().__init__()
# Store parameters that are needed later
self.return_indices = return_indices
# Check in the internal_scale parameter is within the expected range
# [1, channels]
if internal_ratio <= 1 or internal_ratio > in_channels:
raise RuntimeError("Value out of range. Expected value in the "
"interval [1, {0}], got internal_scale={1}. "
.format(in_channels, internal_ratio))
internal_channels = in_channels // internal_ratio
if relu:
activation = nn.ReLU
else:
activation = nn.PReLU
# Main branch - max pooling followed by feature map (channels) padding
self.main_max1 = nn.MaxPool2d(
2,
stride=2,
return_indices=return_indices)
# Extension branch - 2x2 convolution, followed by a regular, dilated or
# asymmetric convolution, followed by another 1x1 convolution. Number
# of channels is doubled.
# 2x2 projection convolution with stride 2
self.ext_conv1 = nn.Sequential(
nn.Conv2d(
in_channels,
internal_channels,
kernel_size=2,
stride=2,
bias=bias), nn.BatchNorm2d(internal_channels), activation())
# Convolution
self.ext_conv2 = nn.Sequential(
nn.Conv2d(
internal_channels,
internal_channels,
kernel_size=3,
stride=1,
padding=1,
bias=bias), nn.BatchNorm2d(internal_channels), activation())
# 1x1 expansion convolution
self.ext_conv3 = nn.Sequential(
nn.Conv2d(
internal_channels,
out_channels,
kernel_size=1,
stride=1,
bias=bias), nn.BatchNorm2d(out_channels), activation())
self.ext_regul = nn.Dropout2d(p=dropout_prob)
# PReLU layer to apply after concatenating the branches
self.out_activation = activation()
def forward(self, x):
# Main branch shortcut
if self.return_indices:
main, max_indices = self.main_max1(x)
else:
main = self.main_max1(x)
# Extension branch
ext = self.ext_conv1(x)
ext = self.ext_conv2(ext)
ext = self.ext_conv3(ext)
ext = self.ext_regul(ext)
# Main branch channel padding
n, ch_ext, h, w = ext.size()
ch_main = main.size()[1]
padding = torch.zeros(n, ch_ext - ch_main, h, w)
# Before concatenating, check if main is on the CPU or GPU and
# convert padding accordingly
if main.is_cuda:
padding = padding.cuda()
# Concatenate
main = torch.cat((main, padding), 1)
# Add main and extension branches
out = main + ext
return self.out_activation(out), max_indices
```

### Bottleneck（上采样）

Bottleneck（b）定义如下所示（主路径含1x1conv和maxunpooling），该部分用于完成上采样操作，主分支包括1x1conv和maxunpooling，其中maxunpooling利用了downsampling bottleneck的maxpooling索引，负分支包括1x1conv用于改变通道数、3x3转置卷积和随机丢弃层。

```
class UpsamplingBottleneck(nn.Module):
"""The upsampling bottlenecks upsample the feature map resolution using max pooling indices stored from the corresponding downsampling bottleneck. Main branch: 1. 1x1 convolution with stride 1 that decreases the number of channels by ``internal_ratio``, also called a projection; 2. max unpool layer using the max pool indices from the corresponding downsampling max pool layer. Extension branch: 1. 1x1 convolution with stride 1 that decreases the number of channels by ``internal_ratio``, also called a projection; 2. transposed convolution (by default, 3x3); 3. 1x1 convolution which increases the number of channels to ``out_channels``, also called an expansion; 4. dropout as a regularizer. Keyword arguments: - in_channels (int): the number of input channels. - out_channels (int): the number of output channels. - internal_ratio (int, optional): a scale factor applied to ``in_channels`` used to compute the number of channels after the projection. eg. given ``in_channels`` equal to 128 and ``internal_ratio`` equal to 2 the number of channels after the projection is 64. Default: 4. - dropout_prob (float, optional): probability of an element to be zeroed. Default: 0 (no dropout). - bias (bool, optional): Adds a learnable bias to the output if ``True``. Default: False. - relu (bool, optional): When ``True`` ReLU is used as the activation function; otherwise, PReLU is used. Default: True. """
def __init__(self,
in_channels,
out_channels,
internal_ratio=4,
dropout_prob=0,
bias=False,
relu=True):
super().__init__()
# Check in the internal_scale parameter is within the expected range
# [1, channels]
if internal_ratio <= 1 or internal_ratio > in_channels:
raise RuntimeError("Value out of range. Expected value in the "
"interval [1, {0}], got internal_scale={1}. "
.format(in_channels, internal_ratio))
internal_channels = in_channels // internal_ratio
if relu:
activation = nn.ReLU
else:
activation = nn.PReLU
# Main branch - max pooling followed by feature map (channels) padding
self.main_conv1 = nn.Sequential(
nn.Conv2d(in_channels, out_channels, kernel_size=1, bias=bias),
nn.BatchNorm2d(out_channels))
# Remember that the stride is the same as the kernel_size, just like
# the max pooling layers
self.main_unpool1 = nn.MaxUnpool2d(kernel_size=2)
# Extension branch - 1x1 convolution, followed by a regular, dilated or
# asymmetric convolution, followed by another 1x1 convolution. Number
# of channels is doubled.
# 1x1 projection convolution with stride 1
self.ext_conv1 = nn.Sequential(
nn.Conv2d(
in_channels, internal_channels, kernel_size=1, bias=bias),
nn.BatchNorm2d(internal_channels), activation())
# Transposed convolution
self.ext_tconv1 = nn.ConvTranspose2d(
internal_channels,
internal_channels,
kernel_size=2,
stride=2,
bias=bias)
self.ext_tconv1_bnorm = nn.BatchNorm2d(internal_channels)
self.ext_tconv1_activation = activation()
# 1x1 expansion convolution
self.ext_conv2 = nn.Sequential(
nn.Conv2d(
internal_channels, out_channels, kernel_size=1, bias=bias),
nn.BatchNorm2d(out_channels))
self.ext_regul = nn.Dropout2d(p=dropout_prob)
# PReLU layer to apply after concatenating the branches
self.out_activation = activation()
def forward(self, x, max_indices, output_size):
# Main branch shortcut
main = self.main_conv1(x)
main = self.main_unpool1(
main, max_indices, output_size=output_size)
# Extension branch
ext = self.ext_conv1(x)
ext = self.ext_tconv1(ext, output_size=output_size)
ext = self.ext_tconv1_bnorm(ext)
ext = self.ext_tconv1_activation(ext)
ext = self.ext_conv2(ext)
ext = self.ext_regul(ext)
# Add main and extension branches
out = main + ext
return self.out_activation(out)
```

### ENet

enet定义如下所示。

【参数】

num_classes ：分类数

encoder_relu ：true则在编码器中使用relu，否则使用prelu

decoder_relu ：同encoder_relu

```
class ENet(nn.Module):
"""Generate the ENet model. Keyword arguments: - num_classes (int): the number of classes to segment. - encoder_relu (bool, optional): When ``True`` ReLU is used as the activation function in the encoder blocks/layers; otherwise, PReLU is used. Default: False. - decoder_relu (bool, optional): When ``True`` ReLU is used as the activation function in the decoder blocks/layers; otherwise, PReLU is used. Default: True. """
def __init__(self, num_classes, encoder_relu=False, decoder_relu=True):
super().__init__()
self.initial_block = InitialBlock(3, 16, relu=encoder_relu)
# Stage 1 - Encoder
self.downsample1_0 = DownsamplingBottleneck(
16,
64,
return_indices=True,
dropout_prob=0.01,
relu=encoder_relu)
self.regular1_1 = RegularBottleneck(
64, padding=1, dropout_prob=0.01, relu=encoder_relu)
self.regular1_2 = RegularBottleneck(
64, padding=1, dropout_prob=0.01, relu=encoder_relu)
self.regular1_3 = RegularBottleneck(
64, padding=1, dropout_prob=0.01, relu=encoder_relu)
self.regular1_4 = RegularBottleneck(
64, padding=1, dropout_prob=0.01, relu=encoder_relu)
# Stage 2 - Encoder
self.downsample2_0 = DownsamplingBottleneck(
64,
128,
return_indices=True,
dropout_prob=0.1,
relu=encoder_relu)
self.regular2_1 = RegularBottleneck(
128, padding=1, dropout_prob=0.1, relu=encoder_relu)
self.dilated2_2 = RegularBottleneck(
128, dilation=2, padding=2, dropout_prob=0.1, relu=encoder_relu)
self.asymmetric2_3 = RegularBottleneck(
128,
kernel_size=5,
padding=2,
asymmetric=True,
dropout_prob=0.1,
relu=encoder_relu)
self.dilated2_4 = RegularBottleneck(
128, dilation=4, padding=4, dropout_prob=0.1, relu=encoder_relu)
self.regular2_5 = RegularBottleneck(
128, padding=1, dropout_prob=0.1, relu=encoder_relu)
self.dilated2_6 = RegularBottleneck(
128, dilation=8, padding=8, dropout_prob=0.1, relu=encoder_relu)
self.asymmetric2_7 = RegularBottleneck(
128,
kernel_size=5,
asymmetric=True,
padding=2,
dropout_prob=0.1,
relu=encoder_relu)
self.dilated2_8 = RegularBottleneck(
128, dilation=16, padding=16, dropout_prob=0.1, relu=encoder_relu)
# Stage 3 - Encoder
self.regular3_0 = RegularBottleneck(
128, padding=1, dropout_prob=0.1, relu=encoder_relu)
self.dilated3_1 = RegularBottleneck(
128, dilation=2, padding=2, dropout_prob=0.1, relu=encoder_relu)
self.asymmetric3_2 = RegularBottleneck(
128,
kernel_size=5,
padding=2,
asymmetric=True,
dropout_prob=0.1,
relu=encoder_relu)
self.dilated3_3 = RegularBottleneck(
128, dilation=4, padding=4, dropout_prob=0.1, relu=encoder_relu)
self.regular3_4 = RegularBottleneck(
128, padding=1, dropout_prob=0.1, relu=encoder_relu)
self.dilated3_5 = RegularBottleneck(
128, dilation=8, padding=8, dropout_prob=0.1, relu=encoder_relu)
self.asymmetric3_6 = RegularBottleneck(
128,
kernel_size=5,
asymmetric=True,
padding=2,
dropout_prob=0.1,
relu=encoder_relu)
self.dilated3_7 = RegularBottleneck(
128, dilation=16, padding=16, dropout_prob=0.1, relu=encoder_relu)
# Stage 4 - Decoder
self.upsample4_0 = UpsamplingBottleneck(
128, 64, dropout_prob=0.1, relu=decoder_relu)
self.regular4_1 = RegularBottleneck(
64, padding=1, dropout_prob=0.1, relu=decoder_relu)
self.regular4_2 = RegularBottleneck(
64, padding=1, dropout_prob=0.1, relu=decoder_relu)
# Stage 5 - Decoder
self.upsample5_0 = UpsamplingBottleneck(
64, 16, dropout_prob=0.1, relu=decoder_relu)
self.regular5_1 = RegularBottleneck(
16, padding=1, dropout_prob=0.1, relu=decoder_relu)
self.transposed_conv = nn.ConvTranspose2d(
16,
num_classes,
kernel_size=3,
stride=2,
padding=1,
bias=False)
def forward(self, x):
# Initial block
input_size = x.size()
x = self.initial_block(x)
# Stage 1 - Encoder
stage1_input_size = x.size()
x, max_indices1_0 = self.downsample1_0(x)
x = self.regular1_1(x)
x = self.regular1_2(x)
x = self.regular1_3(x)
x = self.regular1_4(x)
# Stage 2 - Encoder
stage2_input_size = x.size()
x, max_indices2_0 = self.downsample2_0(x)
x = self.regular2_1(x)
x = self.dilated2_2(x)
x = self.asymmetric2_3(x)
x = self.dilated2_4(x)
x = self.regular2_5(x)
x = self.dilated2_6(x)
x = self.asymmetric2_7(x)
x = self.dilated2_8(x)
# Stage 3 - Encoder
x = self.regular3_0(x)
x = self.dilated3_1(x)
x = self.asymmetric3_2(x)
x = self.dilated3_3(x)
x = self.regular3_4(x)
x = self.dilated3_5(x)
x = self.asymmetric3_6(x)
x = self.dilated3_7(x)
# Stage 4 - Decoder
x = self.upsample4_0(x, max_indices2_0, output_size=stage2_input_size)
x = self.regular4_1(x)
x = self.regular4_2(x)
# Stage 5 - Decoder
x = self.upsample5_0(x, max_indices1_0, output_size=stage1_input_size)
x = self.regular5_1(x)
x = self.transposed_conv(x, output_size=input_size)
return x
```

## 文章评论