Advanced Computer Vision Chapter 5
The authors introduced two new blocks, which utilize factorized convolutions. The first
(second in total) is the equivalent of block A, we introduced preceding. The following
represents the image:
Inception block B. When n=3, it is equivalent to block A
The second (third in total) block is similar, but the asymmetrical convolutions are parallel,
resulting in a higher output depth (more concatenated paths). The hypothesis here is that
the more features (different filters) the network has, the faster it learns (we also discussed
the need for more filters in Chapter 4, Computer Vision with Convolutional Networks). On the
other hand, the wider layers take more memory and computation time. As a compromise,
this block is only used in the deeper part of the network, after the other blocks:
Inception block C
[ 139 ]
Advanced Computer Vision Chapter 5
Using these new blocks, the authors proposed two new inception networks: v1 and v2.
Another major improvement in this version is the use of batch normalization, which was
introduced by the same authors. For more information about Inception v2 and v3, check
out the original paper, Rethinking the Inception Architecture for Computer Vision (https://
arxiv.o rg/a bs/1512.0 0567), by Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe,
Jonathon Shlens, and Zbigniew Wojna, as well as Batch Normalization: Accelerating Deep
Network Training by Reducing Internal Covariate Shift (https://arxiv.org/a bs/1512.0 0567),
by Sergey Ioffe and Christian Szegedy.
Inception v4 and Inception-ResNet
In the latest revision of inception networks, the authors introduce three new streamlined
inception blocks that build upon the idea of the previous versions. They introduce 7 x 7
asymmetric factorized convolutions, and average pooling instead of max pooling. More
importantly, they create a residual/inception hybrid network known as Inception-ResNet,
where the inception blocks also include residual connections. We can see the schematic of
one such block in the following diagram:
An inception block with residual skip connection
For more information about the new inception blocks and the network architectures, check
out the original paper, Inception-v4, Inception-ResNet, and the Impact of Residual Connections on
Learning (https:// a rxiv.o rg/abs/1 602.0 7261), by Christian Szegedy, Sergey Ioffe,
Vincent Vanhoucke, and Alex Alemi.
[ 140 ]
Advanced Computer Vision Chapter 5
Xception and MobileNets
The last inception network we'll discuss is Xception (from Extreme Inception). To
understand its hypothesis, let's recall that in Chapter 3, Deep Learning Fundamentals,
Computer Vision, and Convolutional Layers, we introduced standard and depthwise
convolutions. An output slice in standard convolution receives input from all input slices
using a single filter. The filter tries to learn features in a 3D space, where two of the
dimensions are spatial (the height and width of the slice) and the third is the channel.
Therefore, the filter maps both spatial and cross-channel correlations.
All inception blocks so far have started with a dimensionality-reduction 1 x 1 convolution.
From our new point of view, this connection maps cross-channel correlations, but not
spatial ones (because of the 1 x 1 filter size). On the other hand, the subsequent operations
in an inception block are standard convolutions, therefore mapping both types of
correlations. The author of Xception argues that, in fact, we can completely decouple cross-
channel and spatial correlations. We can do this with the so-called depthwise separable
convolutions. A depthwise separable convolution combines two operations: a depthwise
convolution and a 1 x 1 convolution. In a depthwise convolution, a single input slice
produces a single output slice, therefore it only maps spatial (and not cross-channel)
correlations. With 1 x 1 convolutions, we have the opposite. The following image represents
the depthwise convolution:
A depthwise separable convolution
[ 141 ]
Advanced Computer Vision Chapter 5
Let's compare the standard and depthwise separable convolutions.
Imagine that we have 32 input and output channels, and a filter with a
size of 3 x 3. In a standard convolution, one output slice is the result of
applying one filter for each of the 32 input slices for a total of 32 x 3 x 3 =
288 weights (excluding bias). In a comparable depthwise convolution, the
filter has only 3 x 3 = 9 weights and the filter for the 1 x 1 convolution has
32 x 1 x 1 = 32 weights. The total number of weights is 32 + 9 = 41.
Therefore, the depthwise separable convolution is both faster and more
memory-efficient compared to the standard one.
We can think of the depthwise separable convolution as an extreme (hence the name)
version of an inception block, where each depthwise input/output slice pair represents one
parallel path. We have as many parallel paths as the number of input slices. One difference
with the other inception blocks is that the 1 x 1 convolution comes last, instead of first. But
these operations are meant to be stacked anyway, and we can assume that the order is of no
significance. Another difference is the absence of non-linear activation (ReLU or ELU)
between the two operations. According to the author's experiments, networks with absent,
non-linearity depthwise convolution converged faster and were more accurate.
The Xception network is built entirely of depthwise separable convolutions and it also
includes residual connections. For more information, check out the original paper, Xception:
Deep Learning with Depthwise Separable Convolutions (https://arxiv.o rg/a bs/1 610.02357),
by François Chollet.
MobileNets are another class of models, built with depthwise separable convolutions.
These networks are lightweight and specifically optimized for mobile and embedded
applications. You can read more about them in the original paper, MobileNets: Efficient
Convolutional Neural Networks for Mobile Vision Applications (https:// arxiv.o rg/a bs/1704.
04861), by Andrew G. Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun
Wang, Tobias Weyand, Marco Andreetto, and Hartwig Adam, as well as the new version,
MobileNetV2: Inverted Residuals and Linear Bottlenecks (https:// a rxiv.org/a bs/1 801.0 4381),
by Mark Sandler, Andrew Howard, Menglong Zhu, Andrey Zhmoginov, and Liang-Chieh
Chen.
[ 142 ]
Advanced Computer Vision Chapter 5
DenseNets
DenseNet stands for Densely-Connected Convolutional Networks. It tries to alleviate the
vanishing gradient problem and improve feature propagation, while reducing the number
of network parameters. We've already seen how ResNets introduce residual blocks with
skip connections to solve this. DenseNets take some inspiration from this idea and
introduce dense blocks. A dense block consists of sequential convolutional layers, where
any layer has a direct connection to all subsequent layers:
A dense block: The dimensionality-reduction layers (dashed lines) are part of the DenseNet-B architecture, while the original DenseNet doesn't have them
Here are some properties of the dense block:
The different inputs are merged via concatenation, unlike ResNets, which use
sum.
A batch normalization and ReLU are applied over each concatenation, and then
the result is fed to the following convolutional layer.
A dense block is specified by its number of convolutional layers and the output
volume depth of each layer, which is called growth rate in this context. Let's
assume that the input of the dense block has a volume depth of k0 and the output
volume depth of each convolutional layer is k. Then, because of the
concatenation, the input volume depth for the l-th layer will be k0 + k x (l −
1). The authors also introduced a second type of dense net, DenseNet-B, which
applies a dimensionality-reduction 1 x 1 convolution after each concatenation.
Although the later layers of a dense block have a large input volume depth
(because of the many concatenations), DenseNets can work with growth rate
values as low as 12, which reduces the total number of parameters.
[ 143 ]
Advanced Computer Vision Chapter 5
To make concatenation possible, dense blocks use padding in such a way that the
height and width of all output slices are the same throughout the block. The
network uses average pooling between the dense blocks for downsampling.
For more information about DenseNets, check out the original paper, Densely Connected
Convolutional Networks (https://arxiv.o rg/abs/1 608.06993) by Gao Huang, Zhuang Liu,
Laurens van der Maaten, and Kilian Q. Weinberger.
Capsule networks
Capsule networks were introduced by Geoffrey Hinton as a way to overcome some of the
limitations of standard CNNs. To understand the idea behind capsule networks, we need to
understand these limitations first.
Limitations of convolutional networks
Let's start with a quote from professor Hinton himself:
"The pooling operation used in convolutional neural networks is a big mistake and the fact
that it works so well is a disaster."
What he means is that the CNNs are translation-invariant. To understand this, let's
imagine a picture with a face, located in the right half of the picture. Translation invariance
means that a CNN is very good at telling us that the picture contains a face, but it cannot
tell us whether the face is in the left or right part of the image. The main culprit for this
behavior is the pooling layers. Every pooling layer introduces a little translation invariance.
For example, the max pooling routes forward the activation of only one of the input
neurons, but the subsequent layers don't have any knowledge of which neuron is routed.
By stacking multiple pooling layers, we gradually increase the receptive field size. But the
detected object can be anywhere in the new receptive field, because none of the pooling
layers relay such information. Therefore, we also increase the translation invariance. At
first, this might seem to be a good thing, because the final labels have to be translation-
invariant. But it poses a problem, as CNNs cannot identify the position of one object
relative to another. It would identify both images following image as a face, because they
both contain the ingredients of a face, a nose, mouth, and eyes, regardless of their relative
positions to one another.
[ 144 ]
Advanced Computer Vision Chapter 5
This is also known as the "Picasso problem," as demonstrated in the following diagram:
A convolutional network would identify both of these images as a face
But that's not all. A CNN would be confused even if the face had a different orientation, for
example, if it was turned upside down. One way to overcome this is with data
augmentation (rotation) during training. But this only shows the limitations of the network.
We have to explicitly show it the object in different orientations and tell it that this is, in
fact, the same object.
So far, we've seen that a CNN discards the translation information (transitional invariance)
and doesn't understand the orientation of an object. In computer vision, the combination of
translation and orientation is known as pose. The pose is enough to uniquely identify the
object's properties in the coordinate system. Let's use computer graphics to illustrate this. A
3D object, say a cube, is entirely defined by its pose and the edge length. The process of
transforming the representation of a 3D object into an image on the screen is called
rendering. Knowing just its pose and the edge length of the cube, we can render it from any
point of view we like. Therefore, if we can somehow train a network to understand these
properties, we won't have to feed it with multiple augmented versions of the same object. A
CNN cannot do that, because its internal data representation doesn't contain information
about the object's pose (only about its type). In contrast, capsule networks preserve
information for both the type and the pose of an object. Therefore, they can detect objects
that can transform to each other, which is known as equivariance. We can also think of this
as "reverse graphics," that is, a reconstruction of the object's properties by its rendered
image.
[ 145 ]
Advanced Computer Vision Chapter 5
Capsules
To solve these problems, the authors of the paper propose a new type of network building
block, called a capsule, instead of the neuron. The output of a neuron is a scalar value. In
contrast, the output of a capsule is a vector (a list of values), which consists of the following:
The elements of the vector represent the pose and other properties of the object.
The length of the vector is in the (0, 1) range and represents the probability of
detecting the feature at that location. As a reminder, the length of a vector is
, where vi are the vector elements.
Let's consider a capsule, which detects faces. If we start moving a face across the image, the
values of the capsule vector will change to reflect the change in the position. However, its
length will always stay the same, because the probability of the face doesn't change with
the location.
The capsules are organized in interconnected layers, just such as a regular network. The
capsules in one layer serve as input to the capsules in the next. And such as a CNN, the
earlier layers detect basic features, and the deeper layers combine them in more abstract
and complex ones. But now the capsules also relay positional information, instead of just
detected objects. This allows the deeper capsules to analyze not only the presence of
features, but also their relationship. For example, a capsule layer may detect a mouth, face,
nose, and eyes. The subsequent capsule layer will be able to not only verify the presence of
these features, but also whether they have the correct spatial relationship. Only if both
conditions are true can the subsequent layer verify that a face is present. This is a high-level
overview of capsule networks. Now, let's see how exactly capsules work.
We can see the schematic of a capsule in the following diagram:
A capsule
[ 146 ]
Advanced Computer Vision Chapter 5
Let's analyze it in the following bullets:
The capsule inputs are the output vectors, u1, u2, ... un, from the capsules of the
previous layer.
We multiply each vector, ui, by its corresponding weight matrix, Wij, to produce
prediction vectors, . The weight matrices, W, encode spatial and
other relationships between the lower-level features, coming from the capsules of
the previous layer, and the high-level ones in the current layer. For example,
imagine that the capsule in the current layer detects faces and the capsules from
the previous layer detect the mouth (u1), eyes (u2), and nose (u3). Then,
is the predicted position of the face, given where the location of the
mouth is. In the same way, predicts the location of the face based
on the detected location of the eyes, and predicts the location of
the face based on the location of the nose. If all three lower-level capsule vectors
agree on the same location, then the current capsule can be confident that a face
is indeed present. We only used location for this example, but the vectors could
encode other types of relationships between the features, such as scale and
orientation. The weights, W, are learned with backpropagation.
Next, we multiply the vectors by the scalar coupling coefficients, cij. These
coefficients are a separate set of parameters, apart from the weight matrices. They
exist between any two capsules and indicate which high-level capsules will
receive input from a lower-level capsule. But unlike weight matrices, which are
adjusted via backpropagation, coupling coefficients are computed on the fly
during the forward pass via a process called dynamic routing. We'll describe it in
the next section.
Then, we perform the sum of the weighted input vectors. This step is similar to
the weighted sum in neurons, but with vectors:
[ 147 ]
Advanced Computer Vision Chapter 5
Finally, we'll compute the output of the capsule, vj, by squashing the vector, sj. In
this context, squashing means transforming the vector in such a way that its
length comes in the (0, 1) range, without changing its direction. As mentioned,
the length of the capsule vector represents the probability of the detected feature
and squashing it in the (0, 1) range reflects that. To do this, the authors propose a
novel formula:
Dynamic routing
Let's describe the dynamic routing process to compute the coupling coefficients, cij. In the
following diagram, we have a lower capsule, I, that has to decide whether to send its
output to one of two higher-level capsules, J and K. The dark and light dots represent
prediction vectors, and , which J and K have already received from other lower-
level capsules. The arrows from the I capsule to the J and K capsules point to the and
prediction vectors from I to J and K:
Dynamic routing example. The grouped dots indicate lower-level capsules that agree with each other
[ 148 ]
Advanced Computer Vision Chapter 5
The clustered prediction vectors (lighter dots) indicate lower-level capsules that agree with
each other with regards to the high-level feature. For example, if the K capsule describes a
face, then the clustered predictions would indicate lower-level features, such as mouth,
nose, and eyes. Conversely, the dispersed (darker) dots indicate disagreement. If the I
capsule predicts a vehicle tire, it would disagree with the clustered predictions in K.
However, if the clustered predictions in J represent features such as headlights, windshield,
or fenders, then the prediction of I would be in agreement with them. The lower-level
capsules have a way of determining whether they fall in the clustered or dispersed group of
each higher-level capsule. If they fall in the clustered group, they will increase the
corresponding coupling coefficient with that capsule and will route their vector in that
direction. Conversely, if they fall in the dispersed group, the coefficient will decrease.
Let's formalize this knowledge with a step-by-step algorithm, introduced by the authors:
1. For all i capsules in the l layer, and j capsules in the (l + 1) layer, we'll initialize
, where bij is a temporary variable equivalent to cij. The vector
representation of all bij is . At the start of the algorithm, the i capsule has an
equal chance to route its output to any of the capsules of the (l + 1) layer.
2. Repeat for r iterations, where r is a parameter:
For all i capsules in the l layer: . The sum of all
outgoing coupling coefficients, ci, of a capsule amounts to 1 (they have
a probabilistic nature), hence the softmax.
For all j capsules in the (l + 1) layer: . That is, we'll
compute all non-squashed output vectors of the (l + 1) layer.
For all j capsules in the (l + 1) layer, we'll compute the squashed
vectors: .
For all i capsules in the l layer, and j capsules in the (l + 1) layer:
. Here, is the dot product of the prediction
vector of the low-level i capsule and the output vector of the high-level
j capsule vectors. If the dot product is high, then the i capsule is in
agreement with the other low-level capsules, which route their output
to the j capsule, and the coupling coefficient increases.
[ 149 ]
Advanced Computer Vision Chapter 5
The authors have recently released an updated dynamic routing algorithm using a
clustering technique called Expectation–Maximization. You can read more about it in the
original paper, Matrix Capsules with EM Routing (https://ai.google/research/p ubs/
pub46653) by Geoffrey Hinton, Sara Sabour, and Nicholas Frosst.
Structure of the capsule network
In this section, we'll describe the structure of the capsule network, which the authors used
to classify the MNIST dataset. The input of the network is the 28 x 28 MNIST greyscale
images and the following are the steps:
1. We'll start with a single convolutional layer with 256 9 x 9 filters, stride 1, and
ReLU activation. The shape of the output volume is (256, 20, 20).
2. We have another convolutional layer with 256 9 x 9 filters and stride 2. The shape
of the output volume is (256, 6, 6).
3. Use the output of the layer as a foundation for the first capsule layer, called
PrimaryCaps. Take the (256, 6, 6) output volume and split it in to 32 separate (8,
6, 6) blocks. That is, each of the 32 blocks contains eight 6 x 6 slices. Take one
activation value with the same coordinates from each slice and combine these
values in a vector. For example, we can take activation (3, 7) of slice 1, (3, 7) of
slice 2, and so on and combine them in a vector with a length 8. We'll have 36 of
these vectors. Then we'll "transform" each vector into a capsule for a total of 36
capsules. The shape of the output volume of the PrimaryCaps layer is (32, 8, 6, 6).
4. The second capsule layer is called DigitCaps. It contains 10 capsules (one per
digit), whose output is a vector with length which is 16. The shape of the output
volume of the DigitCaps layer is (10, 16). During inference, we compute the
length of each DigitCaps capsule vector. We then take the capsule with the
longest vector as the prediction result of the network.
5. During training, the network includes three additional, fully-connected layers
after DigitCaps, the last of which has 784 neurons (28 x 28). In the forward
training pass, the longest capsule vector serves as input to these layers. They try
to reconstruct the original image, starting from that vector. Then, the
reconstructed image is compared to the original one and the difference serves as
additional regularization loss for the backward pass.
[ 150 ]
Advanced Computer Vision Chapter 5
Capsule networks are a new and promising approach to computer vision. However, they
are not widely adopted yet and don't have an official implementation in any of the deep
learning libraries discussed in this book, but you can find multiple third-party
implementations.
For more information about capsule networks, check out the original paper, Dynamic
Routing Between Capsules (https://arxiv.o rg/abs/1 710.09829), by Sara Sabour, Nicholas
Frosst, and Geoffrey E Hinton.
Advanced computer vision tasks
So far, we've discussed classification tasks ; a CNN can tell us what object is in the image
and a confidence score, but nothing more. In this section, we'll discuss two more advanced
and interesting tasks: object detection and semantic segmentation.
Object detection
Object detection is the process of finding object instances of a certain class, such as faces,
cars, and trees, in images or videos. Unlike classification, object detection can detect
multiple objects, as well as their location in the image.
An object detector would return a list of detected objects with the following information for
each object:
The class of the object (person, car, tree, and so on).
Probability (or confidence score) in the [0, 1] range, which conveys how confident
the detector is that the object exists in that location. This is similar to the output
of a regular classifier.
The coordinates of the rectangular region of the image where the object is
located. This rectangle is called a bounding box.
[ 151 ]
Advanced Computer Vision Chapter 5
We can see the typical output of an object-detection algorithm in the following photograph.
The object type and confidence score are above each bounding box:
The output of an object detector. The vehicle on the left is wrongly classified as person, but the rest of the objects are classified correctly.
[ 152 ]
Advanced Computer Vision Chapter 5
Approaches to object detection
In this section, we'll outline three approaches:
Classic sliding window: Here, we'll use a regular classification network
(classifier). This approach can work with any type of classification algorithm, but
it's relatively slow and error-prone:
1. Build an image pyramid. This is a combination of different scales of
the same image (see the following photograph). For example, each
scaled image can be two times smaller than the previous one. In this
way, we'll be able to detect objects regardless of their size in the
original image.
2. Slide the classifier across the whole image. That is, we'll use each
location of the image as an input to the classifier and the result will
determine what type of object is in that location. The bounding box of
that location is just the image region that we used as input.
3. We'll have multiple overlapping bounding boxes for each object. We'll
use some heuristics to combine them in a single prediction.
Here is an illustration of the sliding window approach:
Sliding window + image pyramid object detection
[ 153 ]
Advanced Computer Vision Chapter 5
Two-stage detection methods: These methods are very accurate, but relatively
slow. As the name suggests, they involve two steps:
1. A special type of CNN, called a Region Proposal Network, scans the
image and proposes a number of possible bounding boxes where
objects might be located. However, this network doesn't detect the
type of the object, but only whether an object is present in the region.
2. The regions of interest are sent to the second stage for object
classification.
One-stage detection methods: Here, a single CNN produces both the object type
and the bounding box. These approaches are usually faster, but less accurate
compared to two-stage methods.
Object detection with YOLOv3
In this section, we'll discuss one of the most popular detection algorithms, called YOLO.
The name is an acronym for the popular motto "You only live once," which reflects the one-
stage nature of the algorithm. The authors have released three versions with incremental
improvements of the algorithm. We'll first discuss the latest, v3.
Before diving deeper (pun intended), we should mention a few things about YOLO:
It works with a fully-convolutional network (without pooling layers), not unlike
the ones we've seen in this chapter. It uses residual connections and batch
normalization. The YOLOv3 network uses three different scales of the image for
prediction. What makes it different, though, is the use of special type of
groundtruth/output data, which is a combination of classification and regression.
The network takes the whole image as an input and outputs the bounding boxes,
object classes, and confidence scores of all detected objects in just a single pass.
For example, the bounding boxes in the image of people on the crosswalk at the
beginning of this section were generated using a single network pass.
With that introduction, let's see how YOLO works:
1. Split the image into a grid of S x S cells (in the following diagram, we can see a 3
x 3 grid):
The network treats the center of each grid cell as the center of the
region, where an object might be located.
An object might lie entirely within a cell. Then, its bounding box will
be smaller than the cell. Alternatively, it can span over multiple cells
and the bounding box will be larger. YOLO covers both cases.
[ 154 ]
Advanced Computer Vision Chapter 5
YOLO can detect multiple objects in a grid cell with the help of anchor
boxes (more on that later), but an object is associated with one cell only
(1-to-n relation). That is, if the bounding box of the object covers
multiple cells, we'll associate the object with the cell, where the center
of the bounding box lies. For example, the two objects in the following
diagram span multiple cells, but they are both assigned to the central
cell, because their centers lie in it.
Some of the cells may contain object and others might not. We are only
interested in the ones that do:
An object detection YOLO example with a 3 x 3 cell grid, 2 objects, and their bounding boxes (dashed lines). Both objects are associated with the middle cell, because the centers
of their bounding boxes lie in that cell
[ 155 ]
Advanced Computer Vision Chapter 5
2. The network output and target data is a one-stage classifier. Тhe network outputs
possible detected objects for each grid cell. For example, if the grid is 3 x 3, then
the output will contain nine possible detected objects. For the sake of clarity, let's
discuss the output data (and its corresponding label) for a single grid
cell/detected object. It is an array with values, [bx, by, bh, bw, pc, c1, c2,
..., cn], where:
bx, by, bh, bw describes the bounding box, (if an object exists). bx and
by are the coordinates of the upper-left coordinate of the box. They are
normalized in the [0, 1] range with respect to the size of the image.
That is, if the image is of size 100 x 100 and bx = 20 and by = 50, their
normalized values would be 0.2 and 0.5. bh and bw represent the box
height and width. They are normalized with respect to the grid cell. If
the bounding box is larger than the cell, its value will be greater than 1.
Predicting the box parameters is a regression task.
pc is a confidence score in the [0, 1] range. The labels for the confidence
score are either 0 (not present) or 1 (present), making this part of the
output a classification task. If an object is not present, we can discard
the rest of the array values.
c1, c2, ..., cn is a one-hot encoding of the object class. For
example, if we have car, person, tree, cat, and dog classes, and the
current object is of the cat type, its encoding will be [0, 0, 0, 1,
0]. If we have n possible classes, the size of the output array for one
cell would be 5 + n (9 in our example).
The network output/labels will contain SxS such arrays. For example, the length
of the YOLO output for a 3 x 3 cell grid and four classes will be 3 x 3 x 9 = 81.
3. Let's address the scenario with multiple objects in the same cell. Thankfully,
YOLO proposes an elegant solution to this problem. We'll have multiple
candidate boxes (known as anchor boxes or priors) with a slightly different shape
for each cell. In the following diagram, we can see the grid cell (square,
uninterrupted line) and two anchor boxes – vertical and horizontal (dashed
lines). If we have multiple objects in the same cell, we'll associate each object with
one of the anchor boxes. Conversely, if an anchor box doesn't have an associated
object, it will have a confidence score of 0. This arrangement will also change the
network output. We'll have multiple output arrays per grid cell (one output array
per anchor box). To extend our previous example, let's assume we have a 3 x 3
cell grid with four classes and two anchor boxes per cell. Then, we'll have 3 x 3 x
2 = 18 output bounding boxes and a total output length of 3 x 3 x 2 x 9 = 162.
[ 156 ]
Advanced Computer Vision Chapter 5
Following is a figure of a grid cell with two anchor boxes:
Grid cell (square, uninterrupted line) with two anchor boxes (dashed lines)
The only question now is how to choose the proper anchor box for an object
during training (during inference the network will choose by itself). We'll do this
with the help of Intersection over Union (IoU). This is just the ratio between the
area of the intersection of the object bounding box/anchor box, and the area of
their union:
Intersection over Union (IoU)
We'll compare the bounding box of each object to all anchor boxes, and assign the
object to the anchor box with the highest IoU.
[ 157 ]
Advanced Computer Vision Chapter 5
4. Now that we (hopefully) know how YOLO works, we can use it for predictions.
However, the output of the network might be noisy – that is, the output includes
all possible anchor boxes for each cell, regardless of whether an object is present
in them. Many of these boxes will overlap and actually predict the same object.
We'll get rid of the noise using non-maximum suppression. Here's how it works:
1. Discard all bounding boxes with a confidence score <= 0.6.
2. From the remaining bounding boxes, pick the one with the highest
possible confidence score.
3. Discard any box whose IoU >= 0.5 with the box we selected in the
previous step.
If you are worried that the network output/groundtruth data will become
too complex or large, don't be. CNNs work well with the ImageNet
dataset, which has 1,000 categories, and therefore 1,000 outputs.
For more information about YOLO, check out the original sequence of papers:
You Only Look Once: Unified, Real-Time Object Detection (https:// arxiv.org/
abs/1 506.02640) by Joseph Redmon, Santosh Divvala, Ross Girshick, and Ali
Farhadi
YOLO9000: Better, Faster, Stronger (https:// arxiv.org/abs/1 612.0 8242) by
Joseph Redmon and Ali Farhadi
YOLOv3: An Incremental Improvement (https:// arxiv.org/abs/1804.0 2767)
by Joseph Redmon and Ali Farhadi
A code example of YOLOv3 with OpenCV
In this section, we'll demonstrate how to use the YOLOv3 object detector with OpenCV. For
this example, you'll need OpenCV 3.4.2 or higher, and 250 MB of disk space for the pre-
trained YOLO network. Let's begin with the following steps:
1. Start with the imports:
import os.path
import cv2 # opencv import
import numpy as np
import requests
[ 158 ]
Advanced Computer Vision Chapter 5
2. Add some boilerplate code, which downloads and stores the following:
The YOLOv3 network configuration. We'll use the YOLO author's
GitHub and personal website to do this.
The names of the classes that the network can detect. We'll also load
them from the file.
A test image from Wikipedia. We'll also load the image from the file:
# Download YOLO net config file
# We'll it from the YOLO author's github repo
yolo_config = 'yolov3.cfg'
if not os.path.isfile(yolo_config):
url =
'https://raw.githubusercontent.com/pjreddie/darknet/master/cfg/yolo
v3.cfg'
r = requests.get(url)
with open(yolo_config, 'wb') as f:
f.write(r.content)
# Download YOLO net weights
# We'll it from the YOLO author's website
yolo_weights = 'yolov3.weights'
if not os.path.isfile(yolo_weights):
url = 'https://pjreddie.com/media/files/yolov3.weights'
r = requests.get(url)
with open(yolo_weights, 'wb') as f:
f.write(r.content)
# Download class names file
# Contains the names of the classes the network can detect
classes_file = 'coco.names'
if not os.path.isfile(classes_file):
url =
'https://raw.githubusercontent.com/pjreddie/darknet/master/data/coc
o.names'
r = requests.get(url)
with open(classes_file, 'wb') as f:
f.write(r.content)
# load class names
with open(classes_file, 'r') as f:
classes = [line.strip() for line in f.readlines()]
# Download object detection image
image_file = 'source.jpg'
if not os.path.isfile(image_file):
url =
"https://upload.wikimedia.org/wikipedia/commons/c/c7/Abbey_Road_Zeb
[ 159 ]
Advanced Computer Vision Chapter 5
ra_crossing_2004-01.jpg"
r = requests.get(url)
with open(image_file, 'wb') as f:
f.write(r.content)
# read and normalize image
image = cv2.imread(image_file)
blob = cv2.dnn.blobFromImage(image, 1 / 255, (416, 416), (0, 0, 0),
True, crop=False)
3. Initialize the network with the weights and config we just downloaded:
# Load the network
net = cv2.dnn.readNet(yolo_weights, yolo_config)
4. Feed the image to the network and do the inference:
# set as input to the net
net.setInput(blob)
# get network output layers
layer_names = net.getLayerNames()
output_layers = [layer_names[i[0] - 1] for i in
net.getUnconnectedOutLayers()]
# inference
# the network outputs multiple lists of anchor boxes,
# one for each detected class
outs = net.forward(output_layers)
5. Iterate over the classes and anchor boxes and prepare them for the next step:
# extract bounding boxes
class_ids = list()
confidences = list()
boxes = list()
# iterate over all classes
for out in outs:
# iterate over the anchor boxes for each class
for detection in out:
# bounding box
center_x = int(detection[0] * image.shape[1])
center_y = int(detection[1] * image.shape[0])
w = int(detection[2] * image.shape[1])
h = int(detection[3] * image.shape[0])
x = center_x - w // 2
y = center_y - h // 2
[ 160 ]
Advanced Computer Vision Chapter 5
boxes.append([x, y, w, h])
# class
class_id = np.argmax(detection[5:])
class_ids.append(class_id)
# confidence
confidence = detection[4]
confidences.append(float(confidence))
6. Remove the noise with non-max suppression. You can experiment with different
values of score_threshold and nms_threshold to see how the detected
objects change. For example, setting score_threshold=0.3 will detect more
cars in the distance:
# non-max suppression
ids = cv2.dnn.NMSBoxes(boxes, confidences, score_threshold=0.3,
nms_threshold=0.5)
7. Draw the bounding boxes on the image and display the result:
# draw the bounding boxes on the image
colors = np.random.uniform(0, 255, size=(len(classes), 3))
for i in ids:
i = i[0]
x, y, w, h = boxes[i]
class_id = class_ids[i]
color = colors[class_id]
cv2.rectangle(image, (round(x), round(y)), (round(x + w),
round(y + h)), color, 2)
label = "%s: %.2f" % (classes[class_id], confidences[i])
cv2.putText(image, label, (x - 10, y - 10),
cv2.FONT_HERSHEY_SIMPLEX, 1, color, 2)
cv2.imshow("Object detection", image)
cv2.waitKey()
If everything goes alright, this code block will produce the same image that we saw at the
beginning of this section.
[ 161 ]
Advanced Computer Vision Chapter 5
Semantic segmentation
Semantic segmentation is the process of assigning a class label (such as person, car, or tree)
to each pixel of the image. You can think of it as classification, but on a pixel level – instead
of classifying the entire image under one label, we'll classify each pixel separately. Here is
an example of semantic segmentation:
Semantic segmentation
To train a segmentation algorithm, we'll need a special type of groundtruth data, where the
labels for each image are the semantically segmented version of the image.
There are many approaches to semantic segmentation, which we can see the in the
following bullets:
The easiest way to do this is using the familiar sliding-window technique, which
we described in the Approaches to object detection section. That is, we'll use a
regular classifier and we'll slide it in either direction with stride 1. After we get
the prediction for a location, we'll take the pixel that lies in the middle of the
input region and we'll assign it with the predicted class. Predictably, this
approach is very slow, due to the large number of pixels in an image (even a 1024
x 1024 image has more than 1,000,000 pixels).
We can use a special type of CNN, called Fully Convolutional Network (FCN),
to classify all pixels in the input region in a single pass. We can separate an FCN
into two virtual components (in reality, this is just a single network):
The encoder is the first part of the network. It is such as a regular
CNN, without the fully-connected layers at the end. The role of the
encoder is to learn highly abstract representations of the input
image (nothing new here).
[ 162 ]
Advanced Computer Vision Chapter 5
The decoder is the second part of the network. It starts after the
encoder and uses it as input. The role of the decoder is to
"translate" these abstract representations into the segmented
groundtruth data. To do this, the decoder uses the opposite of the
encoder operations. This includes unpooling (the opposite of
pooling) and deconvolutions (the opposite of convolutions). We'll
talk more about this concept (but in different context) in Chapter
6, Generating images with GANs and VAEs.
Artistic style transfer
Artistic style transfer is the use of the style (or texture) of one image to reproduce the
semantic content of another. It can be implemented with different algorithms, but the most
popular way was introduced in 2015 in the paper A Neural Algorithm of Artistic Style
(https://a rxiv.o rg/abs/1508.06576) by Leon A. Gatys, Alexander S. Ecker, and Matthias
Bethge. It's also known as neural style transfer and it uses (you guessed it!) CNNs. The
basic algorithm has been improved and tweaked over the past few years, but in this section
we'll look at the way it was first introduced, because it will give us a good foundation for
understanding the latest versions.
The algorithm takes two images as input:
Content image (C) we would like to redraw
Style image (I) whose style (texture) we'll use to redraw C
The result of the algorithm is a new image: G = C + S. Here is an example of artistic style
transfer:
An example of neural style transfer
[ 163 ]
Advanced Computer Vision Chapter 5
To understand how neural style transfer works, let's recall that CNNs learn a hierarchical
representation of the features. We know that the initial convolutional layers learn basic
features, such as edges and lines. Conversely, the deeper layers learn more complex
features, such as faces, cars, and trees. This is best visible in the diagrams in the What is deep
learning? section of Chapter 3, Deep Learning Fundamentals. Knowing this, let's start with the
following steps:
1. The authors propose we use a regular pre-trained VGG network. Next comes the
interesting part.
2. Feed the network with the content image, C. Extract and store the output
activations (or feature maps or slices) of one more of the hidden layers in the
middle of the network. Let's denote these activations with Acl, where l is the
index of the layer. We're interested in middle layers, because the level of feature
abstraction encoded in them is best suited for the task.
3. Do the same with the style image, S. This time, denote the style activations of the
l layer with Asl. The layers we choose for the content and style are not necessarily
the same.
4. Generate a single random image (white noise), G. This random image will
gradually turn into the end result of the algorithm. We'll repeat for a number of
iterations:
1. Propagate G through the network. This is the only image we'll use
throughout the whole process. Such as before, we'll store the
activations for all the l layers (here, l is the combination of all layers we
used for the content and style images). Let's denote these activations
with Agl.
2. Compute the difference between the random noise activations, Agl, on
one hand, and Acl and Asl on the other. These will be the two
components of our loss function:
, known as content loss: This is
just the mean-square error over the element-wise difference
between the two activations of all l layers.
, known as style loss: It's similar to the content loss,
but instead of raw activations, we'll compare their gram
matrices (we won't go into detail about that).
[ 164 ]
Advanced Computer Vision Chapter 5
3. Use the content and style losses to compute the total loss,
, which is just a weighted sum of the two.
The α and β coefficients determine which of the components will carry
more weight.
4. Backpropagate the gradients to the start of the network and update the
generated image, . In this way, we make G more similar
to both the content and style images, since the loss function is a
combination of both.
This algorithm makes it possible to harness the powerful representational power of
convolutional networks for artistic style transfer. It does this with a novel loss function and
a smart use of backpropagation.
If you are interested in implementing neural style transfer, check out the official PyTorch
tutorial at https:// pytorch.org/tutorials/advanced/neural_style_t utorial.h tml.
One shortcoming of this algorithm is that it's relatively slow. Typically, we have to repeat
this pseudo-training procedure for a couple hundred iterations to produce a visually-
appealing result. Fortunately, the paper Perceptual Losses for Real-Time Style Transfer and
Super-Resolution (https:// arxiv.o rg/abs/1603.0 8155) by Justin Johnson, Alexandre Alahi,
and Li Fei-Fei, builds on top of the original algorithm to provide a solution, which is three
orders of magnitude faster.
Summary
In this chapter, we introduced some new and advanced computer vision techniques. We
started with transfer learning, which is a way to bootstrap network training by using pre-
trained models. Next, we discussed some of the popular neural network architectures in
use today. Then, we talked about capsule networks, which are a promising new approach
to computer vision. After that, we moved on to tasks beyond objects classification, such as
object detection and semantic segmentation. And finally, we introduced neural style
transfer.
In the next chapter, we'll explore a new type of ML algorithms, called generative models.
We can use them to generate new content, such as images. Stay tuned, it will be fun!
[ 165 ]
6
Generating Images with GANs
and VAEs
"What I cannot create, I do not understand."- Richard Feynman
This quote is often cited in the same sentence as generative models, and for good reason. In
the previous two chapters (Chapter 4, Computer Vision with Convolutional Networks and
Chapter 5, Advanced Computer Vision), we focused on supervised computer vision
problems, such as classification and object detection. Now, we'll discuss how to create new
images with the help of unsupervised neural networks. After all, it's a lot better knowing
that you don't need labeled data. More specifically, we'll talk about generative models.
This chapter will cover the following topics:
Intuition and justification of generative models
Variational autoencoders
Generative Adversarial networks
Generating Images with GANs and VAEs Chapter 6
Intuition and justification of generative
models
So far, we've used neural networks as discriminative models. This simply means that given
input data, a discriminative model will map it to a certain label (in other words, a
classification). A typical example is the classification of MNIST images in 1 of 10 digit
classes, where the neural network maps the input data features (pixel intensities) to the
digit label. We can also say this in another way, a discriminative model gives us the
probability of (class), given (input) . In the MNIST case, this is the
probability of the digit, given the pixel intensities of the image.
On the other hand, a generative model learns the distribution of the classes. You can think
of it as the opposite of what the discriminative model does. Instead of predicting the class
probability, , given certain input features, it tries to predict the probability of the input
features, given a class, - . For example, a generative model will be able to create
an image of a handwritten digit, given the digit class. Since we only have 10 classes, it will
be able to generate just 10 images. But we used this example just to better illustrate the
concept. In reality, the "class" could be an arbitrary tensor of values, and the model would
be able to generate an unlimited number of images with different features. If you don't
understand this now, don't worry, we'll see many examples throughout the chapter.
Two of the most popular ways to use neural networks in a generative way are variational
autoencoders(VAEs) and Generative Adversarial networks(GANs).
[ 167 ]
Generating Images with GANs and VAEs Chapter 6
Variational autoencoders
To understand VAEs, let's talk about regular autoencoders first. An autoencoder is a feed-
forward neural network that tries to reproduce its input. In other words, the target value
(label) of an autoencoder is equal to the input data, yi = xi, where i is the sample index.We
can formally say that it tries to learn an identity function, (a function that
repeats its input). Since our "labels" are just the input data, the autoencoder is an
unsupervised algorithm. The following diagram represents an autoencoder:
An autoencoder
[ 168 ]
Generating Images with GANs and VAEs Chapter 6
An autoencoder consists of an input, hidden (or bottleneck), and output layers. Although
it's a single network, we can think of it as a virtual composition of two components:
Encoder: Maps the input data to the network's internal representation. For the
sake of simplicity, in this example the encoder is a single, fully-connected hidden
bottleneck layer. The internal state is just its activation vector. In general, the
encoder can have multiple hidden layers, including convolutional.
Decoder: Tries to reconstruct the input from the network's internal data
representation. The decoder can also have a complex structure, which typically
mirrors the encoder.
We can train the autoencoder by minimizing a loss function, which is known as the
reconstruction error . It measures the distance between the original input and
its reconstruction. We can minimize it in the usual way with gradient descent and
backpropagation. Depending on the approach, we can use either mean square error (MSE)
or binary cross-entropy (such as cross-entropy, but with two classes) as reconstruction
errors. We first introduced MSE in Chapter 1, Machine Learning: an introduction and the
cross-entropy loss in Chapter 3, Deep Learning Fundamentals.
At this point, you might wonder what the point of the autoencoder is, since it just repeats
its input. However, we are not interested in the network output, but in its internal data
representation (which is also known as representation in the latent space). The latent space
contains hidden data features, which are not directly observed, but are inferred by the
algorithm instead. The key is that the bottleneck layer has fewer neurons than the
input/output ones. There are two main reasons for this:
Because the network tries to reconstruct its input from a smaller feature space, it
learns a compact representation of the data. You can think of it as a compression
(but not lossless).
[ 169 ]
Generating Images with GANs and VAEs Chapter 6
By using fewer neurons, the network is forced to learn only the most important
features of the data. To illustrate this concept, let's look at denoising
autoencoders, where we intentionally use corrupted input data, but non-
corrupted target data during training. For example, if we train a denoising
autoencoder to reconstruct MNIST images, we can introduce noise by setting
max intensity (white) to random pixels of the image (the following screenshot).
To minimize the loss with the noiseless target, the autoencoder is forced to look
beyond the noise in the input and learn only the important features of the data.
However, if the network had more hidden neurons than input, it could overfit on
the noise. With the additional constraint of fewer hidden neurons, it has nowhere
to go but to try to ignore the noise. Once trained, we can use a denoising
autoencoder to remove the noise from real images:
Denoising autoencoder input and target
The encoder maps each input sample to the latent space and each attribute of the latent
representation has a discrete value. That means that an input sample can have only one
latent representation. Therefore, the decoder can reconstruct the input in only one possible
way. In other words, we can generate a single reconstruction of one input sample. But we
don't want this. Instead, we want to generate new images that are different from the
original. Enter VAEs.
[ 170 ]
Generating Images with GANs and VAEs Chapter 6
A VAE can describe the latent representation in probabilistic terms. That is, instead of
discrete values, we'll have a probability distribution for each latent attribute, making the
latent space continuous. This makes it easier for random sampling and interpolation. Let's
illustrate this with an example. Imagine that we try to encode an image of a vehicle and our
latent representation has n attributes (n neurons in the bottleneck layer). Each attribute
represents one vehicle property, such as length, height, and width (the following
diagram).Say that the average vehicle length is four meters. Instead of the fixed value, the
VAE can decode this property as a normal distribution with a mean of 4 (the same applies
for the others). Then, the decoder can choose to sample a latent variable from the range of
its distribution. For example, it can reconstruct a longer and lower vehicle, compared to the
input. In this way, the VAE can generate an unlimited number of modified versions of the
input:
An example of a variational encoder, sampling different values from the distribution ranges of the latent variables
Let's formalize this:
We'll denote the encoder with , where are the weights and biases of
the network, is the input, and is the latent space representation. The encoder
output is a distribution (for example, Gaussian) over the possible values of ,
which could have generated .
We'll denote the decoder with , where are the decoder weights and
biases. First, is sampled stochastically (randomly) from the distribution. Then,
it's sent through the decoder, whose output is a distribution over the possible
corresponding values of .
The VAE uses a special type of loss function with two terms:
[ 171 ]
Generating Images with GANs and VAEs Chapter 6
The first is the Kullback-Leibler divergence between the probability distribution
and the expected probability distribution, . It measures how much
information is lost, when we use to represent (in other words, how
close the two distributions are). It encourages the autoencoder to explore different
reconstructions. The second is the reconstruction loss, which measures the
difference between the original input and its reconstruction. The more they differ,
the more it increases. Therefore, it encourages the autoencoder to better
reconstruct the data.
To implement this, the bottleneck layer won't directly output the latent state variables.
Instead, it will output two vectors, which describe the mean and variance of the
distribution of each latent variable:
Variational encoder sampling
Once we have the mean and variance distributions, we can sample a state, , from the
latent variable distributions and pass it through the decoder for reconstruction. But we
cannot celebrate yet, because this presents us with another problem: backpropagation
doesn't work over random processes such as the one we have here. Fortunately, we can
solve this with the so-called reparameterization trick. First, we'll sample a random
vector, ε, with the same dimensions as from a Gaussian distribution (theε circle in the
preceding figure). Then, we'll shift it by the latent distribution's mean, μ, and scale it by the
latent distribution's variance, σ:
In this way, we'll be able to only optimize the mean and variance (red arrows) and we'll
omit the random generator from the backward pass. At the same time, the sampled data
will have the properties of the original distribution.
[ 172 ]
Generating Images with GANs and VAEs Chapter 6
Generating new MNIST digits with VAE
In this section, we'll see how a VAE can generate new digits for the MNIST dataset and
we'll use Keras to do so. We chose MNIST because it will illustrate the generative
capabilities of the VAE well. Let's start:
1. Do the imports:
import matplotlib.pyplot as plt
from matplotlib.markers import MarkerStyle
import numpy as np
from keras import backend as K
from keras.datasets import mnist
from keras.layers import Lambda, Input, Dense
from keras.losses import binary_crossentropy
from keras.models import Model
2. Instantiate the MNIST dataset (we've already done that):
(x_train, y_train), (x_test, y_test) = mnist.load_data()
image_size = x_train.shape[1] * x_train.shape[1]
x_train = np.reshape(x_train, [-1, image_size])
x_test = np.reshape(x_test, [-1, image_size])
x_train = x_train.astype('float32') / 255
x_test = x_test.astype('float32') / 255
3. Implement the build_vae function, which will build the VAE:
We'll have separate access to the encoder, decoder, and the full
network. The function will return them as a tuple.
The bottleneck layer will have only 2 neurons(that is, we'll have only 2
latent variables). In this way, we'll be able to display the latent
distribution as a 2D plot.
The encoder/decoder will contain a single intermediate (hidden) fully-
connected layer with 512 neurons. This is not a convolutional network.
We'll use cross-entropy reconstruction loss and KL divergence.
[ 173 ]
Generating Images with GANs and VAEs Chapter 6
The following is the implementation:
def build_vae(intermediate_dim=512, latent_dim=2):
"""
Build VAE
:param intermediate_dim: size of hidden layers of the
encoder/decoder
:param latent_dim: latent space size
:returns tuple: the encoder, the decoder, and the full vae
"""
# encoder first
inputs = Input(shape=(image_size,), name='encoder_input')
x = Dense(intermediate_dim, activation='relu')(inputs)
# latent mean and variance
z_mean = Dense(latent_dim, name='z_mean')(x)
z_log_var = Dense(latent_dim, name='z_log_var')(x)
# reparameterization trick for random sampling
# Note the use of the Lambda layer
# At runtime, it will call the sampling function
z = Lambda(sampling, output_shape=(latent_dim,),
name='z')([z_mean, z_log_var])
# full encoder encoder model
encoder = Model(inputs, [z_mean, z_log_var, z], name='encoder')
encoder.summary()
# decoder
latent_inputs = Input(shape=(latent_dim,), name='z_sampling')
x = Dense(intermediate_dim, activation='relu')(latent_inputs)
outputs = Dense(image_size, activation='sigmoid')(x)
# full decoder model
decoder = Model(latent_inputs, outputs, name='decoder')
decoder.summary()
# VAE model
outputs = decoder(encoder(inputs)[2])
vae = Model(inputs, outputs, name='vae')
# Loss function
# we start with the reconstruction loss
reconstruction_loss = binary_crossentropy(inputs, outputs) *
image_size
# next is the KL divergence
[ 174 ]
Generating Images with GANs and VAEs Chapter 6
kl_loss = 1 + z_log_var - K.square(z_mean) - K.exp(z_log_var)
kl_loss = K.sum(kl_loss, axis=-1)
kl_loss *= -0.5
# we combine them in a total loss
vae_loss = K.mean(reconstruction_loss + kl_loss)
vae.add_loss(vae_loss)
return encoder, decoder, vae
4. Immediately tied to the network definition is the sampling function, which
implements the random sampling of latent vectors,z, using the
reparameterization trick (introduced in section Variational autoencoders):
def sampling(args: tuple):
"""
Reparameterization trick by sampling z from unit Gaussian
:param args: (tensor, tensor) mean and log of variance of
q(z|x)
:returns tensor: sampled latent vector z
"""
# unpack the input tuple
z_mean, z_log_var = args
# mini-batch size
mb_size = K.shape(z_mean)[0]
# latent space size
dim = K.int_shape(z_mean)[1]
# random normal vector with mean=0 and std=1.0
epsilon = K.random_normal(shape=(mb_size, dim))
return z_mean + K.exp(0.5 * z_log_var) * epsilon
5. Implement the plot_latent_distribution function. It collects the latent
representations of all images in the test set and displays them over a 2D plot. We
can do this because our network has only two latent variables (for the two axes of
the plot). Note that to implement it, we only need the decoder:
def plot_latent_distribution(encoder,
x_test,
y_test,
batch_size=128):
"""
Display a 2D plot of the digit classes in the latent space.
[ 175 ]
Generating Images with GANs and VAEs Chapter 6
We are interested only in z, so we only need the encoder here.
:param encoder: the encoder network
:param x_test: test images
:param y_test: test labels
:param batch_size: size of the mini-batch
"""
z_mean, _, _ = encoder.predict(x_test, batch_size=batch_size)
plt.figure(figsize=(6, 6))
markers = ('o', 'x', '^', '<', '>', '*', 'h', 'H', 'D', 'd',
'P', 'X', '8', 's', 'p')
for i in np.unique(y_test):
plt.scatter(z_mean[y_test == i, 0], z_mean[y_test == i, 1],
marker=MarkerStyle(markers[i],
fillstyle='none'),
edgecolors='black')
plt.xlabel("z[0]")
plt.ylabel("z[1]")
plt.show()
6. Implement the plot_generated_images function.It will sample n*n vectors z
in a [-4, 4] range for each of the two latent variables. Next, it will generate
images based on the sampled vectors and it will display them in a 2D grid. Note
that to do this, we only need the decoder:
def plot_generated_images(decoder):
"""
Display a 2D plot of the generated images.
We only need the decoder, because we'll manually sample the
distribution z
:param decoder: the decoder network
"""
# display a nxn 2D manifold of digits
n = 15
digit_size = 28
figure = np.zeros((digit_size * n, digit_size * n))
# linearly spaced coordinates corresponding to the 2D plot
# of digit classes in the latent space
grid_x = np.linspace(-4, 4, n)
grid_y = np.linspace(-4, 4, n)[::-1]
# start sampling z1 and z2 in the ranges grid_x and grid_y
for i, yi in enumerate(grid_y):
for j, xi in enumerate(grid_x):
[ 176 ]
Generating Images with GANs and VAEs Chapter 6
z_sample = np.array([[xi, yi]])
x_decoded = decoder.predict(z_sample)
digit = x_decoded[0].reshape(digit_size, digit_size)
slice_i = slice(i * digit_size, (i + 1) * digit_size)
slice_j = slice(j * digit_size, (j + 1) * digit_size)
figure[slice_i, slice_j] = digit
# plot the results
plt.figure(figsize=(6, 5))
start_range = digit_size // 2
end_range = n * digit_size + start_range + 1
pixel_range = np.arange(start_range, end_range, digit_size)
sample_range_x = np.round(grid_x, 1)
sample_range_y = np.round(grid_y, 1)
plt.xticks(pixel_range, sample_range_x)
plt.yticks(pixel_range, sample_range_y)
plt.xlabel("z[0]")
plt.ylabel("z[1]")
plt.imshow(figure, cmap='Greys_r')
plt.show()
7. Run the whole thing. We'll use the Adam optimizer (introduced in Chapter
3,Deep Learning fundamentals)to train the network for 50 epochs:
if __name__ == '__main__':
encoder, decoder, vae = build_vae()
vae.compile(optimizer='adam')
vae.summary()
vae.fit(x_train,
epochs=50,
batch_size=128,
validation_data=(x_test, None))
plot_latent_distribution(encoder,
x_test,
y_test,
batch_size=128)
plot_generated_images(decoder)
[ 177 ]
Generating Images with GANs and VAEs Chapter 6
If everything goes to plan, once the training is over, we'll see the latent distribution for each
digit class for all test images.
The left and bottom axes represent the z1 and z2 latent variables. Different marker shapes
represent different digit classes:
The latent distributions of the MNIST test images
[ 178 ]
Generating Images with GANs and VAEs Chapter 6
Next, we'll see the images, generated by plot_generated_images. The axes represent the
particular latent distribution, z, used for each image:
Images generated by the VAE
[ 179 ]
Generating Images with GANs and VAEs Chapter 6
Generative Adversarial networks
In this section, we'll talk about arguably the most popular generative model today: the
GANs framework. It was first introduced in 2014 in the landmark paper Generative
Adversarial Nets(http:// papers.nips.c c/p aper/5423-g enerative-adversarial-nets.pdf)
by Ian J. Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley,
Sherjil Ozair Aaron Courville, and Yoshua Bengio. The GANs framework can work with
any type of data, but it's most popular application by far is to generate images, and we'll
discuss them in this context only. Let's see how it work:
A GAN system
A GAN is a system of two components (neural networks):
Generator: This is the generative model itself. It takes a probability distribution
(random noise) as input and tries to generate a realistic output image. Its purpose
is similar to the decoder part of the VAE.
Discriminator: This takes two alternating inputs: the real images of the training
dataset or the generated fake samples from the generator. It tries to determine
whether the input image comes from the real images or the generated ones.
The two networks are trained together as a system. On the one hand, the discriminator tries
to get better at distinguishing between the real and fake images. On the other hand, the
generator tries to output more realistic images, so it could "deceive" the discriminator into
thinking that the generated image is real. To use the analogy in the original paper, you can
think of the generator as a team of counterfeiters, trying to produce fake currency.
Conversely, the discriminator acts as a police officer, trying to capture the fake money, and
the two are constantly trying to deceive each other (hence the name adversarial). The
ultimate goal of the system is to make the generator so good that the discriminator
wouldn't be able to distinguish between the real and fake images. Even though the
discriminator does classification, a GAN is still unsupervised, since we don't need labels for
the images.
[ 180 ]
Generating Images with GANs and VAEs Chapter 6
Training GANs
Our main goal is for the generator to produce realistic images and the GAN framework is a
vehicle for that goal. We'll train the generator and the discriminator separately and
sequentially (one after the other), and alternate between the two phases multiple times.
Before going into more detail, let's use the following figure to introduce some notations:
We'll denote the generator with , where are the network weights,
and is the latent vector, which serves as an input to the generator. Think of it as
a random seed value to kickstart the image-generation process. It is similar to the
latent vector in the VAEs. has a probability distribution, , which is
usually random normal or random uniform. The generator outputs fake
samples, ,with a probability distribution of . You can think of as
the probability distribution of the real data according to the generator.
We'll denote the discriminator with , where are the network weights.
It takes as input either the real data with the distribution, or the
generated samples, . The discriminator is a binary classifier, which
outputs whether the input image is part of the real (network output 1) or the
generated data (network output 0).
During training, we'll denote the discriminator and generator loss functions with
and , respectively.
Here is a more detailed diagram of a GAN framework:
Detailed example of a Generative Adversarial network
[ 181 ]
Generating Images with GANs and VAEs Chapter 6
GAN training is different compared to the training of a regular DNN, because we have two
networks. We can think of it as a sequential minimax zero-sum game of two players
(generator and discriminator):
Sequential: Means that the players take turns after one another, similar to chess
or tic-tac-toe (as opposed to simultaneous). First, the discriminator tries to
minimize , but it can only do so by adjusting the weights, . Next, the
generator tries to minimize , but it can only adjust the weights, . We
repeat this process multiple times.
Zero-sum: Means that the gains or losses of one player are exactly balanced by
the gains or losses of the opposite player. That is, the sum of the generator's loss
and the discriminator's loss is always 0:
Minimax: Means that the strategy of the first player (generator) is to minimize
the opponent's (discriminator) maximum score (hence the name). When we train
the discriminator, it becomes better at distinguishing between real and fake
samples (minimizing ). Next, when we train the generator, it tries to step up
to the level of the newly-improved discriminator (we minimize , which is
equivalent to maximizing ). The two networks are in constant
competition. We'll denote the minimax game by the following, where is the
cost function:
Let's assume that after a number of training steps, both and will be at
some local minimum. Then, the solution to the minimax game is called the Nash
equilibrium. A Nash equilibrium happens when one of the actors doesn't change
its action, regardless of what the other actor may do. A Nash equilibrium in a
GAN framework happens when the generator becomes so good that the
discriminator is no longer able to distinguish between the generated and real
samples. That is, the discriminator output will always be regardless of the
presented input.
[ 182 ]
Generating Images with GANs and VAEs Chapter 6
Training the discriminator
The discriminator is a classification neural network and we can train it in the usual way,
using gradient descent and backpropagation. However, the training set is composed of
equal parts real and generated samples. Let's see how to incorporate that in the training
process:
1. Depending on the input sample (real or fake), we have two paths:
Select the sample from the real data, , and use it to produce
.
Generate fake sample, . Here, generator and discriminator
work as a single network. We start with a random vector, , which we
use to produce the generated sample, . Then, we use it as input to
the discriminator to produce the final output, .
2. Compute the loss function, which reflects the duality of the training data (more
on that later).
3. Backpropagate the error gradient and update the weights. Although the two
networks work together, the generator weights, , will be locked and we'll only
update the discriminator weights, . This ensures that we'll improve the
discriminator performance by making it better, as opposed to making the
generator worse.
To understand the discriminator loss, let's recall the formula for the cross-entropy loss:
Where is the estimated probability of the output belonging to the i class (out of n total
classes) and is the actual probability. For the sake of simplicity, we'll assume that we
apply the formula over a single training sample. In the case of binary classification, this
formula can be simplified as follows:
In the case where the target probabilities are (one-hot-
encoding), one of the loss terms is always 0.
[ 183 ]
Generating Images with GANs and VAEs Chapter 6
We can expand the formula for a mini-batch of m samples:
Knowing all this, let's define the discriminator loss:
Although it seems complex, it is just a cross-entropy loss for a binary classifier with some
GAN-specific bells and whistles. Let's discuss them:
The two components of the loss reflect the two possible classes (real or fake),
which are in equal number in the training set.
is the loss when the input is sampled from the real data.
Ideally, in such cases, we'll have .
In this context, the term (called expectation) implies that the is
sampled from . In essence, this part of the loss means "when we
sample from , we expect the discriminator output, " . Finally, 0.5 is
the cumulative class probability of the real data, , since it comprises exactly
half of the whole set.
is the loss, when the input is sampled from the generated
data. Here, we can make the same observations as with the real data component.
However,this term is maximized when .
To summarize, the discriminator loss will be zero when for all and
for all generated (or ).
Training the generator
We'll train the generator by making it better at deceiving the discriminator. To do this, we'll
need both networks, similar to the way we train the discriminator with fake samples:
1. We start with a random latent vector, , and feed it through both the generator
and discriminator, to produce the output, .
[ 184 ]
Generating Images with GANs and VAEs Chapter 6
2. The loss function is the same as the discriminator loss. However, our goal here is
to maximize it, rather than minimize it, since we want to deceive the
discriminator.
3. In the backward pass, the discriminator weights, , are locked and we can only
adjust . This forces us to maximize the discriminator loss by making the
generator better, instead of making the discriminator worse.
You may notice that in this phase we only use generated data. The part of the loss function
that deals with real data will always be 0. Therefore,we can simplify it to the following:
The derivative (gradient) of this formula is , displayed in the following figure
with an uninterrupted line.This imposes a limitation on the training. Early on, when the
discriminator can easily distinguish between real and fake samples, ( ), the
gradient will be close to zero. This would result in little learning of the weights, (this
problem is known as diminished gradient):
Gradients of the two generator loss functions
[ 185 ]
Generating Images with GANs and VAEs Chapter 6
We can solve this issue by using a different loss function:
The derivative of this function is displayed in the preceding figure with a dashed line. This
loss is still minimized, when and at the same time the gradient is large, when
the generator underperforms. With this loss, the game is no longer zero-sum, but this won't
have a practical effect on the GAN framework.
Putting it all together
With our newfound knowledge, we can define the minimax objective in full:
In short, the generator tries to minimize the objective, while the discriminator tries to
maximize it. Note that while the discriminator should minimize its loss, the minimax
objective is a negative of the discriminator loss, and therefore the discriminator has to
maximize it.
The following is a step-by-step training algorithm, as it introduced by the authors of the
GAN framework.
Repeat for a number of iterations:
1. Repeat for k steps, where k is a hyperparameter:
Sample a mini-batch of m random samples from the latent space,
.
Sample a mini-batch of m samples from the real data,
.
Update the discriminator weights, , by ascending the stochastic
gradient of its loss:
[ 186 ]
Generating Images with GANs and VAEs Chapter 6
2. Sample a mini-batch of m random samples from the latent space,
.
3. Update the generator by descending the stochastic gradient of its loss:
At the end of this section, we'll mention that the gradient descent algorithm is designed to
find the minimum of the loss function, rather than the Nash equilibrium, which is not the
same thing. As a result, sometimes the training may fail to converge. But due to the
popularity of GANs, many improvements have been proposed. If the reader is interested in
training GANs, do your own research to learn more about them.
Types of GANs
Since the the GAN framework was first introduced, a lot of new variations have emerged.
In fact, there are so many new GANs now that in order to stand out, some of the authors
have come up with creative GAN names, such as BicycleGAN,DiscoGAN, GANs for LIFE,
and ELEGANT.In this section, we'll discuss some of them.
DCGAN
In the original GAN framework proposal, the authors used only fully-connected networks.
The first major improvement of the GAN framework is Deep Convolutional Generative
Adversarial networks (DCGANs). In this new architecture, both the generator and the
discriminator are convolutional networks. They have some constraints, which help to
stabilize the training:
The discriminator uses strided convolutions instead of pooling layers.
The generator is a special type of CNN, which uses fractional-strided
convolutions to increase the size of the images. We'll discuss it in the next section.
Both networks use batch normalization.
No fully-connected layers, with the exception of the last layer of the
discriminator.
[ 187 ]
Generating Images with GANs and VAEs Chapter 6
LeakyReLU (https://en.w ikipedia.o rg/w iki/R ectifier_(neural_
networks)#Leaky_R eLUs)activations for all layers of the generator, except the
output, which uses Tanh (introduced in Chapter 2, Neural networks).
LeakyReLU activations for all layers of the discriminator, except the output,
which uses sigmoid.
You can think of these as general guidelines for GAN training and not just for DCGAN.
The generator in DCGAN
In the following diagram, we can see a sample generator network in the DCGAN
framework:
Generator network with deconvolutional layers
As usual, the generator starts with a random latent vector, . To transform it into an image,
we'll use a network with a special type of convolution operation, called transposed
convolution (also known as deconvolution or fractionally-strided convolution). We briefly
touched on it in Chapter 4, Computer Vision with Convolutional Networks, in the
Backpropagation in convolutional layers section, but let's discuss it in a little more detail
now.You can think of the transposed convolution as an opposite of the regular convolution.
As usual, we have input, output, and a filter with weights. But here, we'll apply the filter
over a single input neuron to produce multiple outputs.
[ 188 ]