ENet:A deep Neural Network Architecture for Real-Time Semasntic Sefmentation
The ability to perform pixel-wise semantic segmentation in real-time is of paramount importance in mobile applications. Recent deep neural networks aimed at this task have the disadvantage of requiring a large number of floating point operations and have long run-times that hinder their usability. In this paper, we propose a novel deep neural network architecture named ENet (efficient neural network), created specifically for tasks requiring low latency operation. ENet is up to 18× faster, requires 75× less FLOPs, has 79× less parameters, and provides similar or better accuracy to existing models. We have tested it on CamVid, Cityscapes and SUN datasets and report on comparisons with existing state-of-the-art methods, and the trade-offs between accuracy and processing time of a network. We present performance measurements of the proposed architecture on embedded systems and suggest possible software improvements that could make ENet even faster.
2、提出ENet(efficient neural network);
4、CamVid、Cityscapes和SUN RGBD数据集中验证模型
1 Introduction
Recent interest in augmented reality wearables, home-automation devices, and self-driving vehicles has created a strong need for semantic-segmentation (or visual scene-understanding) algorithms that can operate in real-time on low-power mobile devices. These algorithms label each and every pixel in the image with one of the object classes. In recent years, the availability of larger datasets and computationally-powerful machines have helped deep convolutional neural networks (CNNs) [1, 2, 3, 4] surpass the performance of many conventional computer vision algorithms [5, 6, 7]. Even though CNNs are increasingly successful at classification and categorization tasks, they provide coarse spatial results when applied to pixel-wise labeling of images. Therefore, they are often cascaded with other algorithms to refine the results, such as color based segmentation [8] or conditional random fields [9], to name a few.
In order to both spatially classify and finely segment images, several neural network architectures have been proposed, such as SegNet [10, 11] or fully convolutional networks [12]. All these works are based on a VGG16 [13] architecture, which is a very large model designed for multi-class classification. These references propose networks with huge numbers of parameters, and long inference times. In these conditions, they become unusable for many mobile or battery-powered applications, which require processing images at rates higher than 10 fps.
In this paper, we propose a new neural network architecture optimized for fast inference and high accuracy. Examples of images segmented using ENet are shown in Figure 1. In our work, we chosenot to use any post-processing steps, which can of course be combined with our method, but would worsen the performance of an end-to-end CNN approach.
In Section 3 we propose a fast and compact encoder-decoder architecture named ENet. It has been designed according to rules and ideas that have appeared in the literature recently, all of which we discuss in Section 4. Proposed network has been evaluated on Cityscapes [14] and CamVid [15] for driving scenario, whereas SUN dataset [16] has been used for testing our network in an indoor situation. We benchmark it on NVIDIA Jetson TX1 Embedded Systems Module as well as on an NVIDIA Titan X GPU. The results can be found in Section 5.
2 Related work
Semantic segmentation is important in understanding the content of images and finding target objects.
This technique is of utmost importance in applications such as driving aids and augmented reality.
Moreover, real-time operation is a must for them, and therefore, designing CNNs carefully is vital.
Contemporary computer vision applications extensively use deep neural networks, which are now one of the most widely used techniques for many different tasks, including semantic segmentation.
This work presents a new neural network architecture, and therefore we aim to compare to other literature that performs the large majority of inference in the same way.
State-of-the-art scene-parsing CNNs use two separate neural network architectures combined together: an encoder and a decoder. Inspired by probabilistic auto-encoders [17, 18], encoder-decoder network architecture has been introduced in SegNet-basic [10], and further improved in SegNet [11]. The encoder is a vanilla CNN (such as VGG16 [13]) which is trained to classify the input, while the decoder is used to upsample the output of the encoder [12, 19, 20, 21, 22]. However, these networks are slow during inference due to their large architectures and numerous parameters. Unlike in fully convolutional networks (FCN) [12], fully connected layers of VGG16 were discarded in the latest incarnation of SegNet, in order to reduce the number of floating point operations and memory footprint, making it the smallest of these networks. Still, none of them can operate in real-time.
Other existing architectures use simpler classifiers and then cascade them with Conditional Random Field (CRF) as a post-processing step [9, 23]. As shown in [11], these techniques use onerous post-processing steps and often fail to label the classes that occupy fewer number of pixels in a frame.
CNNs can be also combined with recurrent neural networks [20] to improve accuracy, but then they suffer from speed degradation. Also, one has to keep in mind that RNN, used as a post-processing step, can be used in conjunction with any other technique, including the one presented in this work.
3 Network architecture(重点)
The architecture of our network is presented in Table 1. It is divided into several stages, as highlighted by horizontal lines in the table and the first digit after each block name. Output sizes are reported for an example input image resolution of 512 × 512. We adopt a view of ResNets [24] that describes them as having a single main branch and extensions with convolutional filters that separate from it,and then merge back with an element-wise addition, as shown in Figure 2b. Each block consists of three convolutional layers: a 1 × 1 projection that reduces the dimensionality, a main convolutional layer (conv in Figure 2b), and a 1 × 1 expansion. We place Batch Normalization [25] and PReLU [26] between all convolutions. Just as in the original paper, we refer to these as bottleneck modules. If the bottleneck is downsampling, a max pooling layer is added to the main branch.
Also, the first 1 × 1 projection is replaced with a 2 × 2 convolution with stride 2 in both dimensions. We zero pad the activations, to match the number of feature maps. conv is either a regular, dilated or full convolution (also known as deconvolution or fractionally strided convolution) with 3 × 3 filters. Sometimes we replace it with asymmetric convolution i.e. a sequence of 5 × 1 and 1 × 5 convolutions. For the regularizer, we use Spatial Dropout [27], with p = 0.01 before bottleneck2.0, and p = 0.1 afterwards.
The initial stage contains a single block, that is presented in Figure 2a. Stage 1 consists of 5 bottleneck blocks, while stage 2 and 3 have the same structure, with the exception that stage 3 does not downsample the input at the beginning (we omit the 0th bottleneck). These three first stages are the encoder. Stage 4 and 5 belong to the decoder.
We did not use bias terms in any of the projections, in order to reduce the number of kernel calls and overall memory operations, as cuDNN [29] uses separate kernels for convolution and bias addition. This choice didn’t have any impact on the accuracy. Between each convolutional layer and following non-linearity we use Batch Normalization [25]. In the decoder max pooling is replaced with max unpooling, and padding is replaced with spatial convolution without bias. We did not use pooling indices in the last upsampling module, because the initial block operated on the 3 channels of the input frame, while the final output has C feature maps (the number of object classes). Also, for performance reasons, we decided to place only a bare full convolution as the last module of the network, which alone takes up a sizeable portion of the decoder processing time.
4 Design choices
In this section we will discuss our most important experimental results and intuitions, that have shaped the final architecture of ENet.
Feature map resolution Downsampling images during semantic segmentation has two main drawbacks. Firstly, reducing feature map resolution implies loss of spatial information like exact edge shape. Secondly, full pixel segmentation requires that the output has the same resolution as the input. This implies that strong downsampling will require equally strong upsampling, which increases model size and computational cost. The first issue has been addressed in FCN [12] by adding the feature maps produced by encoder, and in SegNet [10] by saving indices of elements chosen in max pooling layers, and using them to produce sparse upsampled maps in the decoder. We followed the SegNet approach, because it allows to reduce memory requirements. Still, we have found that strong downsampling hurts the accuracy, and tried to limit it as much as possible.
However, downsampling has one big advantage. Filters operating on downsampled images have a bigger receptive field, that allows them to gather more context. This is especially important when trying to differentiate between classes like, for example, rider and pedestrian in a road scene. It is not enough that the network learns how people look, the context in which they appear is equally important. In the end, we have found that it is better to use dilated convolutions for this purpose [30].
Early downsampling One crucial intuition to achieving good performance and real-time operation is realizing that processing large input frames is very expensive. This might sound very obvious, however many popular architectures do not to pay much attention to optimization of early stages of the network, which are often the most expensive by far.
ENet first two blocks heavily reduce the input size, and use only a small set of feature maps. The idea behind it, is that visual information is highly spatially redundant, and thus can be compressed into a more efficient representation. Also, our intuition is that the initial network layers should not directly contribute to classification. Instead, they should rather act as good feature extractors and only preprocess the input for later portions of the network. This insight worked well in our experiments; increasing the number of feature maps from 16 to 32 did not improve accuracy on Cityscapes [14] dataset.
Decoder size In this work we would like to provide a different view on encoder-decoder architectures than the one presented in [11]. SegNet is a very symmetric architecture, as the encoder is an exact mirror of the encoder. Instead, our architecture consists of a large encoder, and a small decoder.
This is motivated by the idea that the encoder should be able to work in a similar fashion to original classification architectures, i.e. to operate on smaller resolution data and provide for information processing and filtering. Instead, the role of the the decoder, is to upsample the output of the encoder, only fine-tuning the details.
Initial layers weights exhibit a large variance and are slightly biased towards positive values, while in the later portions of the encoder they settle to a recurring pattern. All layers in the main branch behave nearly exactly like regular ReLUs, while the weights inside bottleneck modules are negative i.e. the function inverts and scales down negative values. We hypothesize that identity did not work well in our architecture because of its limited depth. The reason why such lossy functions are learned might be that that the original ResNets [31] are networks that can be hundreds of layers deep, while our network uses only a couple of layers, and it needs to quickly filter out information. It is notable that the decoder weights become much more positive and learn functions closer to identity. This confirms our intuitions that the decoder is used only to fine-tune the upsampled output.
Information-preserving dimensionality changes As stated earlier, it is necessary to downsample the input early, but aggressive dimensionality reduction can also hinder the information flow. A very good approach to this problem has been presented in [28]. It has been argued that a method used by the VGG architectures, i.e. as performing a pooling followed by a convolution expanding the dimensionality, however relatively cheap, introduces a representational bottleneck (or forces one to use a greater number of filters, which lowers computational efficiency). On the other hand, pooling after a convolution, that increases feature map depth, is computationally expensive. Therefore, as proposed in [28], we chose to perform pooling operation in parallel with a convolution of stride 2, and concatenate resulting feature maps. This technique allowed us to speed up inference time of the initial block 10 times.
Additionally, we have found one problem in the original ResNet architecture. When downsampling, the first 1×1 projection of the convolutional branch is performed with a stride of 2 in both dimensions, which effectively discards 75% of the input. Increasing the filter size to 2 × 2 allows to take the full input into consideration, and thus improves the information flow and accuracy. Of course, it makes these layers 4× more computationally expensive, however there are so few of these in ENet, that the overhead is unnoticeable.
Factorizing filters It has been shown that convolutional weights have a fair amount of redundancy, and each n × n convolution can be decomposed into two smaller ones following each other: one with a n × 1 filter and the other with a 1 × n filter [32]. This idea has been also presented in [28], and from now on we adopt their naming convention and will refer to these as asymmetric convolutions.
We have used asymmetric convolutions with n = 5 in our network, so cost of these two operations is similar to a single 3 × 3 convolution. This allowed to increase the variety of functions learned by blocks and increase the receptive field.
What’s more, a sequence of operations used in the bottleneck module (projection, convolution, projection) can be seen as decomposing one large convolutional layer into a series of smaller and simpler operations, that are its low-rank approximation. Such factorization allows for large speedups, and greatly reduces the number of parameters, making them less redundant [32]. Additionally, it allows to make the functions they compute richer, thanks to the non-linear operations that are inserted between layers.
Dilated convolutions As argued above, it is very important for the network to have a wide receptive field, so it can perform classification by taking a wider context into account. We wanted to avoid overly downsampling the feature maps, and decided to use dilated convolutions [30] to improve our model. They replaced the main convolutional layers inside several bottleneck modules in the stages that operate on the smallest resolutions. These gave a significant accuracy boost, by raising IoU on Cityscapes by around 4 percentage points, with no additional cost. We obtained the best accuracy 5 when we interleaved them with other bottleneck modules (both regular and asymmetric), instead of arranging them in sequence, as has been done in [30].
Regularization Most pixel-wise segmentation datasets are relatively small (on order of 103 images), so such expressive models as neural networks quickly begin to overfit them. In initial experiments, we used L2 weight decay with little success. Then, inspired by [33], we have tried stochastic depth, which increased accuracy. However it became apparent that dropping whole branches (i.e. setting their output to 0) is in fact a special case of applying Spatial Dropout [27], where either all of the channels, or none of them are ignored, instead of selecting a random subset. We placed Spatial Dropout at the end of convolutional branches, right before the addition, and it turned out to work much better than stochastic depth.
class InitialBlock(nn.Module):
"""The initial block is composed of two branches: 1. a main branch which performs a regular convolution with stride 2; 2. an extension branch which performs max-pooling. Doing both operations in parallel and concatenating their results allows for efficient downsampling and expansion. The main branch outputs 13 feature maps while the extension branch outputs 3, for a total of 16 feature maps after concatenation. Keyword arguments: - in_channels (int): the number of input channels. - out_channels (int): the number output channels. - kernel_size (int, optional): the kernel size of the filters used in the convolution layer. Default: 3. - padding (int, optional): zero-padding added to both sides of the input. Default: 0. - bias (bool, optional): Adds a learnable bias to the output if ``True``. Default: False. - relu (bool, optional): When ``True`` ReLU is used as the activation function; otherwise, PReLU is used. Default: True. """
def __init__(self,
if relu:
activation = nn.ReLU
activation = nn.PReLU
# Main branch - As stated above the number of output channels for this
# branch is the total minus 3, since the remaining channels come from
# the extension branch
self.main_branch = nn.Conv2d(
out_channels - 3,
# Extension branch
self.ext_branch = nn.MaxPool2d(3, stride=2, padding=1)
# Initialize batch normalization to be used after concatenation
self.batch_norm = nn.BatchNorm2d(out_channels)
# PReLU layer to apply after concatenating the branches
self.out_activation = activation()
def forward(self, x):
main = self.main_branch(x)
ext = self.ext_branch(x)
# Concatenate branches
out = torch.cat((main, ext), 1)
# Apply batch normalization
out = self.batch_norm(out)
return self.out_activation(out)
class RegularBottleneck(nn.Module):
"""Regular bottlenecks are the main building block of ENet. Main branch: 1. Shortcut connection. Extension branch: 1. 1x1 convolution which decreases the number of channels by ``internal_ratio``, also called a projection; 2. regular, dilated or asymmetric convolution; 3. 1x1 convolution which increases the number of channels back to ``channels``, also called an expansion; 4. dropout as a regularizer. Keyword arguments: - channels (int): the number of input and output channels. - internal_ratio (int, optional): a scale factor applied to ``channels`` used to compute the number of channels after the projection. eg. given ``channels`` equal to 128 and internal_ratio equal to 2 the number of channels after the projection is 64. Default: 4. - kernel_size (int, optional): the kernel size of the filters used in the convolution layer described above in item 2 of the extension branch. Default: 3. - padding (int, optional): zero-padding added to both sides of the input. Default: 0. - dilation (int, optional): spacing between kernel elements for the convolution described in item 2 of the extension branch. Default: 1. asymmetric (bool, optional): flags if the convolution described in item 2 of the extension branch is asymmetric or not. Default: False. - dropout_prob (float, optional): probability of an element to be zeroed. Default: 0 (no dropout). - bias (bool, optional): Adds a learnable bias to the output if ``True``. Default: False. - relu (bool, optional): When ``True`` ReLU is used as the activation function; otherwise, PReLU is used. Default: True. """
def __init__(self,
# Check in the internal_scale parameter is within the expected range
# [1, channels]
if internal_ratio <= 1 or internal_ratio > channels:
raise RuntimeError("Value out of range. Expected value in the "
"interval [1, {0}], got internal_scale={1}."
.format(channels, internal_ratio))
internal_channels = channels // internal_ratio
if relu:
activation = nn.ReLU
activation = nn.PReLU
# Main branch - shortcut connection
# Extension branch - 1x1 convolution, followed by a regular, dilated or
# asymmetric convolution, followed by another 1x1 convolution, and,
# finally, a regularizer (spatial dropout). Number of channels is constant.
# 1x1 projection convolution
self.ext_conv1 = nn.Sequential(
bias=bias), nn.BatchNorm2d(internal_channels), activation())
# If the convolution is asymmetric we split the main convolution in
# two. Eg. for a 5x5 asymmetric convolution we have two convolution:
# the first is 5x1 and the second is 1x5.
if asymmetric:
self.ext_conv2 = nn.Sequential(
kernel_size=(kernel_size, 1),
padding=(padding, 0),
bias=bias), nn.BatchNorm2d(internal_channels), activation(),
kernel_size=(1, kernel_size),
padding=(0, padding),
bias=bias), nn.BatchNorm2d(internal_channels), activation())
self.ext_conv2 = nn.Sequential(
bias=bias), nn.BatchNorm2d(internal_channels), activation())
# 1x1 expansion convolution
self.ext_conv3 = nn.Sequential(
bias=bias), nn.BatchNorm2d(channels), activation())
self.ext_regul = nn.Dropout2d(p=dropout_prob)
# PReLU layer to apply after adding the branches
self.out_activation = activation()
def forward(self, x):
# Main branch shortcut
main = x
# Extension branch
ext = self.ext_conv1(x)# 减通道
ext = self.ext_conv2(ext)# asymmetric=true为非对称卷积,否则就是普通卷积
ext = self.ext_conv3(ext)# 增通道
ext = self.ext_regul(ext)# 随机丢弃层
# Add main and extension branches
out = main + ext
return self.out_activation(out)
in_channels :输入通道数
out_channels :输出通道数
internal_ratio :同上
return_indices :maxpooling的索引
dropout_prob、bias 、relu :同上
class DownsamplingBottleneck(nn.Module):
"""Downsampling bottlenecks further downsample the feature map size. Main branch: 1. max pooling with stride 2; indices are saved to be used for unpooling later. Extension branch: 1. 2x2 convolution with stride 2 that decreases the number of channels by ``internal_ratio``, also called a projection; 2. regular convolution (by default, 3x3); 3. 1x1 convolution which increases the number of channels to ``out_channels``, also called an expansion; 4. dropout as a regularizer. Keyword arguments: - in_channels (int): the number of input channels. - out_channels (int): the number of output channels. - internal_ratio (int, optional): a scale factor applied to ``channels`` used to compute the number of channels after the projection. eg. given ``channels`` equal to 128 and internal_ratio equal to 2 the number of channels after the projection is 64. Default: 4. - return_indices (bool, optional): if ``True``, will return the max indices along with the outputs. Useful when unpooling later. - dropout_prob (float, optional): probability of an element to be zeroed. Default: 0 (no dropout). - bias (bool, optional): Adds a learnable bias to the output if ``True``. Default: False. - relu (bool, optional): When ``True`` ReLU is used as the activation function; otherwise, PReLU is used. Default: True. """
def __init__(self,
# Store parameters that are needed later
self.return_indices = return_indices
# Check in the internal_scale parameter is within the expected range
# [1, channels]
if internal_ratio <= 1 or internal_ratio > in_channels:
raise RuntimeError("Value out of range. Expected value in the "
"interval [1, {0}], got internal_scale={1}. "
.format(in_channels, internal_ratio))
internal_channels = in_channels // internal_ratio
if relu:
activation = nn.ReLU
activation = nn.PReLU
# Main branch - max pooling followed by feature map (channels) padding
self.main_max1 = nn.MaxPool2d(
# Extension branch - 2x2 convolution, followed by a regular, dilated or
# asymmetric convolution, followed by another 1x1 convolution. Number
# of channels is doubled.
# 2x2 projection convolution with stride 2
self.ext_conv1 = nn.Sequential(
bias=bias), nn.BatchNorm2d(internal_channels), activation())
# Convolution
self.ext_conv2 = nn.Sequential(
bias=bias), nn.BatchNorm2d(internal_channels), activation())
# 1x1 expansion convolution
self.ext_conv3 = nn.Sequential(
bias=bias), nn.BatchNorm2d(out_channels), activation())
self.ext_regul = nn.Dropout2d(p=dropout_prob)
# PReLU layer to apply after concatenating the branches
self.out_activation = activation()
def forward(self, x):
# Main branch shortcut
if self.return_indices:
main, max_indices = self.main_max1(x)
main = self.main_max1(x)
# Extension branch
ext = self.ext_conv1(x)
ext = self.ext_conv2(ext)
ext = self.ext_conv3(ext)
ext = self.ext_regul(ext)
# Main branch channel padding
n, ch_ext, h, w = ext.size()
ch_main = main.size()[1]
padding = torch.zeros(n, ch_ext - ch_main, h, w)
# Before concatenating, check if main is on the CPU or GPU and
# convert padding accordingly
if main.is_cuda:
padding = padding.cuda()
# Concatenate
main = torch.cat((main, padding), 1)
# Add main and extension branches
out = main + ext
return self.out_activation(out), max_indices
Bottleneck(b)定义如下所示(主路径含1x1conv和maxunpooling),该部分用于完成上采样操作,主分支包括1x1conv和maxunpooling,其中maxunpooling利用了downsampling bottleneck的maxpooling索引,负分支包括1x1conv用于改变通道数、3x3转置卷积和随机丢弃层。
class UpsamplingBottleneck(nn.Module):
"""The upsampling bottlenecks upsample the feature map resolution using max pooling indices stored from the corresponding downsampling bottleneck. Main branch: 1. 1x1 convolution with stride 1 that decreases the number of channels by ``internal_ratio``, also called a projection; 2. max unpool layer using the max pool indices from the corresponding downsampling max pool layer. Extension branch: 1. 1x1 convolution with stride 1 that decreases the number of channels by ``internal_ratio``, also called a projection; 2. transposed convolution (by default, 3x3); 3. 1x1 convolution which increases the number of channels to ``out_channels``, also called an expansion; 4. dropout as a regularizer. Keyword arguments: - in_channels (int): the number of input channels. - out_channels (int): the number of output channels. - internal_ratio (int, optional): a scale factor applied to ``in_channels`` used to compute the number of channels after the projection. eg. given ``in_channels`` equal to 128 and ``internal_ratio`` equal to 2 the number of channels after the projection is 64. Default: 4. - dropout_prob (float, optional): probability of an element to be zeroed. Default: 0 (no dropout). - bias (bool, optional): Adds a learnable bias to the output if ``True``. Default: False. - relu (bool, optional): When ``True`` ReLU is used as the activation function; otherwise, PReLU is used. Default: True. """
def __init__(self,
# Check in the internal_scale parameter is within the expected range
# [1, channels]
if internal_ratio <= 1 or internal_ratio > in_channels:
raise RuntimeError("Value out of range. Expected value in the "
"interval [1, {0}], got internal_scale={1}. "
.format(in_channels, internal_ratio))
internal_channels = in_channels // internal_ratio
if relu:
activation = nn.ReLU
activation = nn.PReLU
# Main branch - max pooling followed by feature map (channels) padding
self.main_conv1 = nn.Sequential(
nn.Conv2d(in_channels, out_channels, kernel_size=1, bias=bias),
# Remember that the stride is the same as the kernel_size, just like
# the max pooling layers
self.main_unpool1 = nn.MaxUnpool2d(kernel_size=2)
# Extension branch - 1x1 convolution, followed by a regular, dilated or
# asymmetric convolution, followed by another 1x1 convolution. Number
# of channels is doubled.
# 1x1 projection convolution with stride 1
self.ext_conv1 = nn.Sequential(
in_channels, internal_channels, kernel_size=1, bias=bias),
nn.BatchNorm2d(internal_channels), activation())
# Transposed convolution
self.ext_tconv1 = nn.ConvTranspose2d(
self.ext_tconv1_bnorm = nn.BatchNorm2d(internal_channels)
self.ext_tconv1_activation = activation()
# 1x1 expansion convolution
self.ext_conv2 = nn.Sequential(
internal_channels, out_channels, kernel_size=1, bias=bias),
self.ext_regul = nn.Dropout2d(p=dropout_prob)
# PReLU layer to apply after concatenating the branches
self.out_activation = activation()
def forward(self, x, max_indices, output_size):
# Main branch shortcut
main = self.main_conv1(x)
main = self.main_unpool1(
main, max_indices, output_size=output_size)
# Extension branch
ext = self.ext_conv1(x)
ext = self.ext_tconv1(ext, output_size=output_size)
ext = self.ext_tconv1_bnorm(ext)
ext = self.ext_tconv1_activation(ext)
ext = self.ext_conv2(ext)
ext = self.ext_regul(ext)
# Add main and extension branches
out = main + ext
return self.out_activation(out)
num_classes :分类数
encoder_relu :true则在编码器中使用relu,否则使用prelu
decoder_relu :同encoder_relu
class ENet(nn.Module):
"""Generate the ENet model. Keyword arguments: - num_classes (int): the number of classes to segment. - encoder_relu (bool, optional): When ``True`` ReLU is used as the activation function in the encoder blocks/layers; otherwise, PReLU is used. Default: False. - decoder_relu (bool, optional): When ``True`` ReLU is used as the activation function in the decoder blocks/layers; otherwise, PReLU is used. Default: True. """
def __init__(self, num_classes, encoder_relu=False, decoder_relu=True):
self.initial_block = InitialBlock(3, 16, relu=encoder_relu)
# Stage 1 - Encoder
self.downsample1_0 = DownsamplingBottleneck(
self.regular1_1 = RegularBottleneck(
64, padding=1, dropout_prob=0.01, relu=encoder_relu)
self.regular1_2 = RegularBottleneck(
64, padding=1, dropout_prob=0.01, relu=encoder_relu)
self.regular1_3 = RegularBottleneck(
64, padding=1, dropout_prob=0.01, relu=encoder_relu)
self.regular1_4 = RegularBottleneck(
64, padding=1, dropout_prob=0.01, relu=encoder_relu)
# Stage 2 - Encoder
self.downsample2_0 = DownsamplingBottleneck(
self.regular2_1 = RegularBottleneck(
128, padding=1, dropout_prob=0.1, relu=encoder_relu)
self.dilated2_2 = RegularBottleneck(
128, dilation=2, padding=2, dropout_prob=0.1, relu=encoder_relu)
self.asymmetric2_3 = RegularBottleneck(
self.dilated2_4 = RegularBottleneck(
128, dilation=4, padding=4, dropout_prob=0.1, relu=encoder_relu)
self.regular2_5 = RegularBottleneck(
128, padding=1, dropout_prob=0.1, relu=encoder_relu)
self.dilated2_6 = RegularBottleneck(
128, dilation=8, padding=8, dropout_prob=0.1, relu=encoder_relu)
self.asymmetric2_7 = RegularBottleneck(
self.dilated2_8 = RegularBottleneck(
128, dilation=16, padding=16, dropout_prob=0.1, relu=encoder_relu)
# Stage 3 - Encoder
self.regular3_0 = RegularBottleneck(
128, padding=1, dropout_prob=0.1, relu=encoder_relu)
self.dilated3_1 = RegularBottleneck(
128, dilation=2, padding=2, dropout_prob=0.1, relu=encoder_relu)
self.asymmetric3_2 = RegularBottleneck(
self.dilated3_3 = RegularBottleneck(
128, dilation=4, padding=4, dropout_prob=0.1, relu=encoder_relu)
self.regular3_4 = RegularBottleneck(
128, padding=1, dropout_prob=0.1, relu=encoder_relu)
self.dilated3_5 = RegularBottleneck(
128, dilation=8, padding=8, dropout_prob=0.1, relu=encoder_relu)
self.asymmetric3_6 = RegularBottleneck(
self.dilated3_7 = RegularBottleneck(
128, dilation=16, padding=16, dropout_prob=0.1, relu=encoder_relu)
# Stage 4 - Decoder
self.upsample4_0 = UpsamplingBottleneck(
128, 64, dropout_prob=0.1, relu=decoder_relu)
self.regular4_1 = RegularBottleneck(
64, padding=1, dropout_prob=0.1, relu=decoder_relu)
self.regular4_2 = RegularBottleneck(
64, padding=1, dropout_prob=0.1, relu=decoder_relu)
# Stage 5 - Decoder
self.upsample5_0 = UpsamplingBottleneck(
64, 16, dropout_prob=0.1, relu=decoder_relu)
self.regular5_1 = RegularBottleneck(
16, padding=1, dropout_prob=0.1, relu=decoder_relu)
self.transposed_conv = nn.ConvTranspose2d(
def forward(self, x):
# Initial block
input_size = x.size()
x = self.initial_block(x)
# Stage 1 - Encoder
stage1_input_size = x.size()
x, max_indices1_0 = self.downsample1_0(x)
x = self.regular1_1(x)
x = self.regular1_2(x)
x = self.regular1_3(x)
x = self.regular1_4(x)
# Stage 2 - Encoder
stage2_input_size = x.size()
x, max_indices2_0 = self.downsample2_0(x)
x = self.regular2_1(x)
x = self.dilated2_2(x)
x = self.asymmetric2_3(x)
x = self.dilated2_4(x)
x = self.regular2_5(x)
x = self.dilated2_6(x)
x = self.asymmetric2_7(x)
x = self.dilated2_8(x)
# Stage 3 - Encoder
x = self.regular3_0(x)
x = self.dilated3_1(x)
x = self.asymmetric3_2(x)
x = self.dilated3_3(x)
x = self.regular3_4(x)
x = self.dilated3_5(x)
x = self.asymmetric3_6(x)
x = self.dilated3_7(x)
# Stage 4 - Decoder
x = self.upsample4_0(x, max_indices2_0, output_size=stage2_input_size)
x = self.regular4_1(x)
x = self.regular4_2(x)
# Stage 5 - Decoder
x = self.upsample5_0(x, max_indices1_0, output_size=stage1_input_size)
x = self.regular5_1(x)
x = self.transposed_conv(x, output_size=input_size)
return x