The words you are searching are inside this book. To get more targeted content, please make full-text search by clicking here.

Home Explore 68_Toshinori_Munakata _Fundamentals_of_the_New_Artif(b-ok.org)_2008_266

View in Fullscreen

68_Toshinori_Munakata _Fundamentals_of_the_New_Artif(b-ok.org)_2008_266

Like this book? You can publish your book online for free in a few minutes!

Download PDF

Related Publications

Discover the best professional documents and content resources in AnyFlip Document Base.

Published by soedito, 2017-08-30 00:32:42

68_Toshinori_Munakata _Fundamentals_of_the_New_Artif(b-ok.org)_2008_266

Pages:

68_Toshinori_Munakata _Fundamentals_of_the_New_Artif(b-ok.org)_2008_266

40 3 Neural Networks: Other Models

perform the same type of task. Grouping models for the same tasks would be
convenient to compare alternative approaches to tackle the problem.

Historically, however, the second type of organization preceded the first type.
These models were developed by certain researchers, and people referred to these
models by the developers' names; thus the names were and still are popular. It is
probably easier to remember and identify these models by their names rather than by
their functional characteristics. In this book, we will basically use the second method
of organization, since the objective here is to introduce the basics rather than to
provide an extensive coverage of the field. With this introductory background, we
will be able to understand extensions of the basics whenever necessary.

3.2 Associative Memory

In Webster's New World Dictionary, "association" is described as "a connection in
the mind between ideas, sensations, memories, etc." The human brain can associate
different types of inputs; for example, we can associate a visual appearance and voice
with a specific person. We can also associate modified values of the same type of
inputs. We can recognize a picture of a person, for instance, or the face of a friend
we have not seen for ten years.

Ordinary computer memory is non-associative. That is, an exact memory address
must be specified, and the only information at this address is retrieved. Certain types
of neural networks can be used as associative memory (or content-addressable
memory). Associative memory is a type of neural network that can map (associate)
inputs, which are contents rather than addresses, to information stored in memory.
That is, given an input (which may be partial, noisy, or may contain an error), the
network can retrieve a complete memory which is the closest match to the input.
Mathematically, this property can be stated as mapping an input vector x to the
closest vector x(s) among vectors x(1), x(2), ..., x(m). To implement an associative
memory, we can use a recurrent neural network and select appropriate weights in
such a way that desired stable outputs will come out for given inputs.

For example, suppose "H.A. Kramers & G.H. Wannier, Phys. Rev. 60, 252
(1941)" is stored in the computer memory. By giving sufficient partial information,
such as "& Wannier, (1941)", our associative memory would be capable of retrieving
the entire memory. An ideal memory could deal with errors and retrieve this memory
even from input "Vannier, (1941)" [Hopfield, 1982]. The following is another simple
example of associative memory.

Example. Pattern association.

Fig. 3.1(a) shows six exemplar patterns of the characters "1," "A," "X," "Y," "Z," and
"C." Each pattern is drawn on a 10 × 12 pixel board, which can be converted to a
one-dimensional array or a vector of 10 × 12 = 120 components. Note that the
two-dimensional view is only for human perception; the computer does not
understand these patterns as two-dimensional. The value of each vector element is -1
if the corresponding pixel on the board is white (blank), 1 if black. Fig. 3.1(b)

3.3 Hopfield Networks 41

shows how a noisy input pattern converges to the closest exemplar pattern during
iterations. These figures were drawn by my graduate student Bill Leach. The m = 6
exemplar patterns can be denoted as vectors x(1), x(2), ..., and x(6), and the noisy input
vector as x. This type of pattern association can be implemented by using associative
memory, such as the Hopfield network discussed in the next section.

(a)

(b)

Fig. 3.1. Demonstration of associative memory by pattern association. (a) Six exemplar
patterns. (b) A noisy input pattern approaches its closest exemplar pattern.

Many types of associative memories have been proposed. In this chapter, we
illustrate the basic idea by the Hopfield network.

3.3 Hopfield Networks

The Hopfield network model is probably the second most popular type of neural
network after the backpropagation model. There are several versions of Hopfield
networks. They can be used as associative memory, as we will discuss in this section,
and they can also be applied to optimization problems, as we will study in the next
section. The version for the associative memory is classified as supervised learning
by some authors and as unsupervised by others, with the distinction based on the
authors’ interpretation of the definitions. Given that the network performs pattern
association under the supervision of a teacher, we use the former definition in this
book.

The basic idea of the Hopfield network is that it can store a set of exemplar
patterns as multiple stable states. Given a new input pattern, which may be partial or
noisy, the network can converge to one of the exemplar patterns that is nearest to the
input pattern. This is the basic concept of applying the Hopfield network as
associative memory.

42 3 Neural Networks: Other Models

Architecture
As shown in Fig. 3.2, a Hopfield network consists of a single layer of neurons, 1, 2, ...,
n. The network is fully interconnected; that is, every neuron in the network is
connected to every other neuron. The network is recurrent; that is, it has
feedforward/feedbackward capabilities, which means input to the neurons comes
from external input as well as from the neurons themselves internally.

Fig. 3.2. A Hopfield network configuration.

Each input/output, xi or yj, takes a discrete bipolar value of either 1 or -1. The number
of neurons, n, is the size required for each pattern in the bipolar representation. For
example, suppose that each pattern is a letter represented by an 10 × 12
two-dimensional array, where each array element is either 1 for a black square or -1
for a blank square (for example, Fig. 3.1). Then n will be 10 × 12 = 120.

Each edge is associated by weight, wij, which satisfies the following conditions:
wij = wji for all i, j = 1, n

and
wii = 0 for all i = 1, n.

(Because wii = 0, the self-coupling edge of Neuron i, i.e., the edge from Neuron i to
Neuron i, can be considered as "not connected.") The values of wij are specified in
advance from exemplar patterns and fixed as we will see in the following example.
Computational procedures
Determining wij
Suppose that m exemplar patterns are presented (s = 1, m). Each pattern x(s) has n
inputs, x1(s), x2(s), ..., xn(s) , where xk(s) = 1 or -1. Determine wij for i, j = 1 to n by:

3.3 Hopfield Networks 43

wij = 0 for i = j
for i ≠ j
m

∑ x x(s) (s)
ij
s=1

After this determination of wij, the values of wij's are not changed. This feature is

different from the backpropagation model, where wij's are changed as the learning

process proceeds. The Hopfield network is classified under supervised learning since

at the beginning it is given correct exemplar patterns by a teacher.
Interpretation of wij is as follows. For a given pattern s, if xi(s) = 1 and xj(s) = 1 (i.e.,

both neurons i and j are active), or if xi(s) = -1 and xj(s) = -1 (i.e., both neurons are
inactive), then xi(s)xj(s) = 1 and a positive contribution to wij results. If this occurs for

the majority of m patterns, then wij > 0, i.e., the synapse between neurons i and j

becomes excitatory. The higher the number of such patterns, the more excitatory the

synapse.
On the other hand, if xi(s) = 1 and xj(s) = -1, or if xi(s) = -1 and xj(s) = 1 (i.e., if xi(s)xj(s)

= -1), then a negative contribution to wij results. If this occurs for the majority of

patterns, then wij < 0, i.e., the synapse becomes inhibitory.

New input xi(0), and xi

Every input xi(0) and xi is bipolar (i.e., either -1 or +1). The initial values of xi(0), i
= 1, n are given at time t = 0. Let neti(t) = Σnj=1 wijxj(t). Then xi(t+1), for i = 1 to n is
determined as follows:

xi(t+1) = 1 if neti(t) > θi
xi(t) if neti(t) = θi
-1 if neti(t) < θi

Here θi are the thresholds. θi are usually set to 0.
The neurons are updated one at a time for a specific value of i by xi(t+1) ← xi(t).

The only constraint on the updates is that all the neurons must be updated at the same

average rate. Often the neurons are picked out at a uniformly random rate. Starting

with a given initial input xi(0), xi(t) can converge to the closest stored pattern.
An intuitive interpretation of this process follows. Assume that we have a small

number of uncorrelated exemplar patterns in comparison with the number of neurons.

neti in the above formula at iteration step t can be expressed by using the following
definitions for neti and wij:

n

∑neti = wijxj
j =1

n⎛ m ⎞
j=1, j≠i ⎝⎜ s =1 ⎟⎠
∑ ∑= x x( s ) ( s ) xj
ij

mn
∑ ∑x= (s)
i x x(s)
jj

s=1 j=1, j≠i

44 3 Neural Networks: Other Models

At the last step, we swapped the two summations (one for s and the other for j) since
they are independent, and factored xi(s), which does not depend on j, outside of
Σnj=1, j≠i.

Now imagine that the unknown input pattern x closely resembles a specific
exemplar pattern x(s') and is totally uncorrelated to the remaining exemplar patterns.
The last factor Σnj=1, j≠i xj(s) xj will have a large value for the close exemplar pattern
since x(s') xj will be 1 for most of j (this is because when xj is 1, x(s') is likely to be 1,
and when xj is -1, xj(s') is likely to be -1). If xj and xj(s') completely match for all of j
except j = i, the summation will be n - 1. The first factor, x(s') is multiplied by the large

second factor and will significantly contribute toward neti. On the other hand, the
second factor, Σnj=1, j≠i x(s) xj, will be much smaller for the remaining uncorrelated
exemplar patterns. This is because sometimes both xj and xj(s') have the same value
(either 1 or -1), resulting in xj(s) xj = 1; some other times xj and xj(s') have opposite
values, resulting in xj(s) xj = -1; thus, the 1's and -1's cancel out, yielding a small sum.

This implies that the contributions from the remaining patterns toward neti are small.
The overall effect is that x(t+1) of the next iteration step will be closer to x(s'), or
xi(t+1) will be closer to xi(s') in terms of each component.

The actual outcome of the converged solution may not necessarily be the closest

matched exemplar pattern. It can be some other exemplar pattern or a pattern

different from any of the exemplar patterns. To reduce the probability of such an

error, the number of exemplar patterns, m, should be less than 0.15n [Hopfield, 1982].

In practice, m is typically kept well below 0.15n. Several factors affect the outcome

of a converged solution. They include the ratio m/n, correlation among the exemplar

patterns, the initial values of xi(0)'s, and the updating processes (the scheme of

picking out the neurons and the random number seeds).

Energy function

Define E, an energy (or Lyapunov) function, as:

∑ ∑ ∑E = − 1 n
nn

wijxi(t)xj(t) + θ ixi(t)
2 i=1 j=1
i =1

We can prove that as iterations proceed, E always decreases when xi(t) changes the
value, and E stays the same when there is no change in xi(t)'s.

When we think of the term "energy," we think of a physical quantity such as
kinetic or electric energy. This was probably the idea when the equation was
originally defined, but we interpret our energy in much a broader sense in our
applications. The term "energy" here represents a measure that reflects the state of the
solution, and is somewhat analogous to the concept of physical energy.

Basic processing steps

Step 1. Storing exemplar patterns as preprocessing.

Determine wij for i, j = 1, n, for exemplar patterns using the
"Determining wij" procedure described above.

3.3 Hopfield Networks 45

Step 2. Finding the closest matching exemplar pattern to given input
representing an unknown pattern.

At t = 0, apply given input xi(0), i = 1, n.

Perform iterations updating xi(t)'s until the energy function E stops
decreasing (or, equivalently, xi(t)'s remain unchanged). Then xi(t)
represents the solution, that is, the exemplar pattern that best matches
(associates to) the unknown input. The converged xi(t) can be sent out
externally as output yi.

Implementation considerations of Step 2 above
There are two conditions:

1. Selection of neurons. Each neuron is picked out at uniformly random, independent
of other neurons, and at the same average rate. For each neuron xi, the new value
updates the value of xi, and will be used in the computation of other neurons (and
xi itself, if it is picked up again).

2. Convergence. If and only if a neuron changes its state, then the energy decreases.
A solution is converged upon if all the neurons are updated without any change.

The following are possible methods for examining the above conditions.

1. Select i = 1, 2, ..., n, in this order (this is an epoch). Test for convergence. If not
converged, then go to the next epoch. This method violates Condition 1 (it is not
random).

2. Select i from 1 to n, at a uniformly random rate, independent of other neurons, n
times (this is an epoch). Test for convergence. If not converged, then go to the next
epoch. This method violates Condition 2 (some neurons may not be updated).

3. Select a unique i every time from 1 to n, in random order, n times; that is, every
number between 1 to n is picked up once and only once (this is an epoch). Test for
convergence. If not converged, then go to the next epoch. This method can be
implemented by randomly permuting numbers 1 to n, then picking out one number
at a time in order. This method violates Condition 1 (neurons are not picked out
independently of other neurons, because the probabilities of neurons being picked
out gets higher when they have not been picked out).

4. Select i from 1 to n, at a uniformly random rate, independent of other neurons, until
every neuron is updated at least once (this is an epoch). Test for convergence. If
not converged, then go to the next epoch.

A possible implementation of Method 4 is: Define an “update vector,” q = (q1, . . ., qn),
where qi = 1 if xi has been updated at least once since the beginning of the epoch,
otherwise qi = 0. For each epoch,
Initialize: q = 0, i.e., qi = 0 for i = 1, n. count = n, where count is the number of
non-updated neurons.

Repeat: Every time xi is updated, check qi. If qi = 0, then set qi = 1 and c = c – 1; if c =
0, then the end of the epoch is reached.

This last method has no violation of either Condition 1 or Condition 2. Methods 1, 2

46 3 Neural Networks: Other Models

and 3 involve Condition 1 or 2 violation, but they seem to be used in practice. My
students experimented with Methods 1 through 4. For midsize problems (say, n = 120),
Method 4 took about 5 to 7 times the number of iterations of the other three for an
epoch. The number of epochs required for the network to converge was about the same
for all the methods.

Example.

The pattern association example (Fig. 3.1) can be implemented by the Hopfield
network architecture and algorithm discussed in this section. The network has 120
neurons. In Step 1, the network stores m = 6 exemplar patterns given in Fig. 3.1(a),
by assigning appropriate values of wij. At t = 0 in Step 2, the network is given the
unknown pattern (the first pattern in Fig. 3.1(b), xi, i = 1, 120). During the subsequent
iterations, xi(t) will gradually converge to the exemplar pattern"A" that matches the
unknown input.

3.4 The Hopfield-Tank Model for Optimization Problems:
The Basics

We will now discuss a neural network model which is particularly popular in
optimization problems. In their 1985 article, Hopfield and Tank reported that the
traveling salesman problem (TSP) can be solved much faster by using their model
than by using existing methods. Although the model does not necessarily determine
the optimal solution, it can find solutions which are close enough for practical
purposes. This was a blockbuster for the computing community, since solving
NP-complete problems in an efficient way has been a major obstacle. Soon after the
publication of the article, several papers came out which contested Hopfield and
Tank's claims. Since then, however, more and more application problems have been
solved using the model, and it has become a major optimization technique. Since
optimization problems are very common and important in many disciplines, such as
engineering and management, this model will be a valuable alternative to other
classical techniques found in calculus and operations research.

In this section, we will discuss the basics of the Hopfield-Tank technique; in the
next section, we will study some application examples and provide a brief general
guideline to apply the model to optimization problems. We will describe the model in
a one-dimensional case since it is fundamental and easy to understand. We will then
extend it to a two-dimensional configuration, since for certain problems this
approach is more efficient. Extensions for higher dimensions can be done similarly.

3.4.1 One-Dimensional Layout

Imagine that n neurons, i = 1 to n, are one-dimensionally laid out. Weight wij is
associated with the edge from neuron i to neuron j. Assume the symmetric property
for the weights, wij = wji.

3.4 The Hopfield-Tank Model for Optimization Problems: The Basics 47

Computational procedures
Basic equations
The equations of motion are given as follows:

∑dui = −ui + n wijVj + Ii for i = 1, n (1)

dt j=1

u (t +1) = u (t) + dui ⋅+t for i = 1, n (2)
i i (3)
dt

{ }Vj = g(uj) = 1 1+ tanh uj for j = 1, n
2 u0

The energy function is defined as follows:

∑ ∑ ∑E = − 1 n
nn (4)

wijViVj − ViIi
2 i=1 j =1
i =1

In the above, ui is the net input, Vi is the output, and Ii is the external input for neuron
i; wij, Ii, for i, j = 1, n, and u0 are constant values determined appropriately by the
problem characteristics. In equation (2), ui(t) represents ui at time step t, and Δt is a
small constant representing a time increment on each step. Equation (3) for Vj is a
sigmoid function between 0 and 1 (Fig. 3.3). Note that the smaller the value of u0, the
steeper the slope of the graph. Recall that tanh x = (ex - e-x)/(ex + e-x). The problem is

to determine Vi for i = 1, n in such a way that energy E in equation (4) becomes
minimum.

As mentioned in the previous section, the term "energy" should be interpreted in

a broader sense rather than physical energy. For example, in optimization problems,

our energy can be a cost, time, or distance to be minimized. Similarly, by the

equations of motion, we might imagine physical kinetic equations for a hard ball

rolling on a frictionless hard surface. Rather, theses equations should be interpreted

in a much broader sense in applications; the basic idea is that these equations force

our solution to the desired direction.

Iteration process to determine Vi for i = 1, n
Initialization

Step 0. Choose ui at t = 0 (may be denoted as ui(0)) at random; for example,

48 3 Neural Networks: Other Models

Fig. 3.3. A sigmoid function for V in terms of u.

ui(0) = u0 + δui

where δui is a random number chosen uniformly in the interval of -0.1
u0 ≤ δui ≤ 0.1 u0. Also, choose a large positive number as a dummy
initial value of E.

Iteration. Repeat the following steps until E stops decreasing.

Step 1. Compute Vj from uj by equation (3) for j = 1, n.

Step 2. Compute E using equation (4).

Step 3. Compare the current E(t) with E(t-1), the E value one iteration before.
(i) If E(t) ≥ E(t-1), stop iteration; Vj, for j = 1, n are the solutions.

(ii) Otherwise continue (go to Step 4).

Step 4. Using equation (1) compute dui/dt for i = 1, n. Using equation (2),
compute ui for i = 1, n for the next iteration step. (Go to Step 1.)

3.4.2 Two-Dimensional Layout

There are many interesting application problems that can be associated with an n ×
n square matrix. For these problems, representing the previously discussed variables
and expressions in two-dimension form is convenient. For easy understanding, we
discuss one- and two-dimensional configurations in two steps. The major difference
in two-dimensional configuration is the way the variable subscripts are represented.
The variables with a single subscript in the one-dimensional case will be replaced
with variables with double subscripts.

An alternative method to the two-dimensional approach covered in this subsection
is to "spread out" the two-dimensional elements into one-dimensional arrangement
and use the previous procedures. For an n × n two-dimensional array, the spread-out
subscripts will be: 1, 2, ..., n, n + 1, ..., n2. The idea can be extended to
higher-dimensions.

3.5 The Hopfield- Tank Model for Optimization Problems: Applications 49

Imagine that n2 neurons are laid out two-dimensionally, in the same configuration
as the elements in an n × n square matrix. The neurons are identified by row and
column numbers; Neuron ik is the i-th row, k-th column unit. The entire set of
neurons can be represented as Neuron ik, for i = 1, n and k = 1, n. The weight
associated with the edge from neuron ik to neuron jℓ is denoted as wik,jℓ. We can
obtain the basic equations for two-dimensional layout by replacing the subscripts and
summations in the one-dimensional case as: i by ik, j by jℓ, Σi by Σi Σk, and Σj by Σj
Σℓ.

Computational procedures

Basic equations

The equations of motion are given as follows:

∑ ∑duik = −uik + wik , VjA jA + Iik for i = 1, n and k = 1, n (1')
dt j A for i = 1, n and k = 1, n (2')
for j = 1, n and ℓ = 1, n (3')
u (t +1) = u (t) + duik ⋅+t
ik ik
dt

{ }1

VjA = g (ujA) =
1 + tanh ujA
2 u0

The energy function is defined as:

∑ ∑ ∑ ∑ ∑ ∑E = − 1
2i k j A wik , V VjA ik jA − VikIik (4')

ik

The iteration process is the same as before except that n2 (instead of n) values of uik
and Vik are computed.

3.5 The Hopfield-Tank Model for Optimization Problems:
Applications

3.5.1 The n-Queen Problem

We now consider a simple example to illustrate how the basic equations are set up
and how iterations proceed for a specific problem. Our simple example is the n-queen
problem, which happens to be NP-complete (which means computationally hard).

Probably most of us are already familiar with the n-queen problem. On an n × n
chessboard, a queen can move any number of squares up, down, to the right, to the
left, and diagonally. Fig. 3.4 (a) illustrates a possible move for a queen for a case of
n = 5. The problem is to determine n "safe" positions for n queens. A safe position
means that none of the queens can move to a square occupied by another queen in

50 3 Neural Networks: Other Models

only one move. Obviously there must be one and only one queen in each row and
column. (If two or more queens are in a row or column, then one would be attacked
by another; if no queens are in a row or column, there must be a row or column that
has more than one queen.) Similarly, there must be at most one queen in each
diagonal direction (since there can be no queen in a diagonal direction). Fig. 3.4 (b)
is a solution for n = 5; generally, a solution is not unique for a specific value of n.

(a) (b)

Fig. 3.4. The n-queen problem illustration for n = 5. (a) A possible move of a queen. (b) A
solution.

Formulation of the problem
Conditions to be satisfied
1. Exactly one queen in each row.
2. Exactly one queen in each column.
3. At most one queen in upward diagonal (from left-bottom to right-top) (Fig.

3.5(a)).
4. At most one queen in downward diagonal (from left-top to right-bottom) (Fig. 3.5

(b)).
Diagonal position representations
Given a square position (i, j) on the chessboard, its diagonal positions can be
represented as follows (Figs. 3.4 and 3.5).
Upward diagonal: (i+k, j+k), where k = max [1-i, 1-j] to min [n-i, n-j].
Downward diagonal: (i+k, j-k), where k = max [1-i, j-n] to min [n-i, j-1].

3.5 The Hopfield- Tank Model for Optimization Problems: Applications 51

(a) (b)

Fig. 3.5 (a) Upward diagonal (from left-bottom to right-top) squares. (b) Downward diagonal
(from left-top to right-bottom) squares.

The following Fig. 3.6 (a) and (b) illustrate two examples for upward diagonal
positions.

Fig. 3.6.Two examples of upward diagonal positions.

In Fig. 3.6 (a), (i, j) is (4, 2), and k = -1, 0, and 1 for (i +k, j + k) will give the three
diagonal positions, (3, 1), (4, 2) and (5, 3), respectively. Similarly, in Fig. 3.6 (b), (i,
j) is (2, 4), and k = -1, 0, and 1 for (i + k, j + k) will represent the three diagonal
positions, (1, 3), (2, 4) and (3, 5), respectively. In general, the range of k for the
upward diagonal must satisfy (1 ≤ i + k ≤ n) and (1 ≤ j + k ≤ n) to stay inside of the
range of the chessboard. By moving i and j to the outer expressions, we have (1 - i ≤
k ≤ n - i) and (1 - j ≤ k ≤ n - j). For the lower bound of k, k must satisfy both 1 - i ≤ k
and 1 - j ≤ k, which means k must be greater than or equal to whichever the larger of
1 - i and 1 - j. Hence, the lower bound of k is max[1 - i, 1 - j]. Similarly, the upper
bound of k is min[n - i, n - j].

For downward diagonal, k must satisfy (1 ≤ i + k ≤ n) and (1 ≤ j - k ≤ n) to stay
inside of the chessboard. This leads to the range of k as max [1 - i, j - n] to min [n -
i, j - 1].

Definition of Vij 0 if no queen on ij-location
Vij = 1 otherwise

52 3 Neural Networks: Other Models

Since Vij represents a solution, and our solution should give either "yes queen" or "no
queen" for each square, the above is a reasonable definition for Vij.

Basic equations

Now we define our equations in such a way to drive our solution in a better and better

direction in terms of satisfying the conditions. The equation of motion for duij/dt can
be defined either to increase or decrease uij, which in turn either to increase or
decrease Vij. To increase uij, we should make duij/dt > 0; to decrease uij, we should
make duij/dt < 0; in case we want to keep the current value of uij, we should make
duij/dt = 0. With these considerations in mind, we propose the following equation of
motion.

=∑ ∑duij⎧⎛ n −1⎠⎟⎞ + ⎛ n −1⎟⎠⎞ +
− ⎨⎪⎜⎝ ⎝⎜
dt Vik Vkj
⎪
k =1 k =1

⎩

⎛ ⎞ ⎛ ⎞⎫
∑ ∑⎜ Vi + k , j − k ⎟⎟⎪⎪⎬

⎜
⎜ k =max[1−i,1− j] ⎟⎜ ⎟⎠⎟⎪⎪⎭
⎜ to min[n−i,n− j] Vi + k , j + k ⎟ + ⎜
⎝ and(k ≠0) ⎟ ⎜ k =max[1−i, j−n]
⎟ ⎜ to min[n−i, j−1]
⎠ ⎝ and(k ≠0)

On the right-hand side of the equation, we have four terms. Let us examine the
effect of the first term: (Σnk=1 Vik - 1) (which is inside of -{...}). Our definition of Vij
is: Vij = 0 if no queen, at square ij, Vij = 1 otherwise. Thus, the sum Σnk=1 Vik adds up

the values of V in column i. If there is one queen in column i (which is ideal for the
column), Σnk=1 Vik will be 1, and the first term (Σnk=1 Vik - 1) will be zero; i.e., no

first-term effect to duij/dt, or change in uij caused by the first term. If there is no queen
in column i, Σnk=1 Vik will be 0, and the first term (Σnk=1 Vik - 1) will be negative, or
more precisely, -1. Then - { (Σnk=1 Vik - 1)} will be positive, and this term will

contribute to duij/dt > 0, i.e., it will increase uij. The effect is to increase Vij, the
number of queens in this column. If there is more than one queen in column i, Σnk=1
Vik will be greater than 1, and the first term (Σnk=1 Vik - 1) will be positive. The effect

is the opposite of no queen in the column. Furthermore, if there are many queens in

the column, the magnitude of the effect will be even greater. The effect is to decrease

Vij, the number of queens in this column. The second term is exactly the same as the

first term except that it is for row j.

The third and fourth terms inside of -{...} correspond to the upward and downward

diagonals, respectively. The idea for these terms is somewhat similar to that of the

first and second terms. For example, for the third term, Σk = max [1-i, 1-j] to min [n-i, n-j]

and (k≠0) Vi+k,j+k, we add up the values of V's along the upward diagonal except Vij, the

one in the column i, row j position. If there are one or more queens along the diagonal

line, this term will be positive, i.e., -{Σk=.. Vi+k,j+k} will be negative, and affects to

reduce the value of uij, i.e., to make Vij =0. If there is no queens along the diagonal

line, this term will be zero, no effect to the value of uij, i.e., it means to keep the

3.5 The Hopfield- Tank Model for Optimization Problems: Applications 53

current value of Vij.
The energy function can be defined as:

=∑ ∑ ∑ ∑E1⎧n ⎛ n − 1⎟⎞⎠2 + n ⎛ n − 1⎠⎟⎞2 +
2 ⎪ i =1 ⎜⎝ j =1 ⎝⎜
⎨ Vik Vkj
⎪
k =1 k =1

⎩

∑ ∑ ∑ ∑ ∑ ∑V V +i + k, j + k ij ⎫
V V ⎬i + k , j − k ij
⎭
i j k(k≠0) i j k(k≠0)

The sigmoid function for Vjℓ can be defined as,

VjA = 1 ⎝⎜⎛1 + tanh ujA ⎞
2 u0 ⎠⎟

Since our goal is to make Vjℓ = 0 or 1, we can choose a small value of u0 (e.g, 0.1) (Fig.
3.7).

Fig. 3.7. The sigmoid function Vjℓ = 1/2 {1 + tanh (ujℓ / u0)} with small u0.

Alternatively, we can choose even a simpler function for Vjℓ in terms of ujℓ as follows.

Vjℓ = 1 for ujℓ ≥ 0
0 ujℓ < 0

Note. In this example, the equation of motion does not have -uij and Iij terms
discussed for general cases.

Termination condition of iterations

54 3 Neural Networks: Other Models

We can continue until E = 0, rather than E(n+1) ≥ E(n) to avoid a possible local minima,
where (n) represents the n-th iteration.

3.5.2 A General guideline to apply the Hopfield-Tank model to
optimization problems

After studying a specific application of the Hopfield-Tank model to the n-queen
problem, we are now in a better position to understand how to apply the model to
other optimization problems. Here is a guideline.

• Define Vij so that they can represent solutions of the problem. (e.g., Vij = 0 means
no queen, 1 means yes queen)

• Define an equation of motion as:

{ }duij = − function of Vij

dt

in such a way that uij tends to decrease when Vij is supposed to decrease, and
conversely, uij tends to increase when Vij is supposed to increase.

• Values of wij are fixed by the equation of motion.

• Values of Vij change through changes of uij. In turn, the changes of uij are affected
by the changes of Vij. The changes of Vij are much slower than those of uij, because
of commonly used activation function forms. For example, in Fig. 3.7, the value of

V stays the same as either 0 or 1 for most values of u. That is, uij changes frequently;
Vij changes occasionally after significant changes of uij.
• Classical techniques typically use differential equations in terms of Vij themselves,
e.g., dVij/dt = ..., while the Hopfield-Tank uses uij as an intermediate stepping
stone to solve Vij. Biological systems use such "indirect control".

• The energy function can be derived based on the fundamental principle for the
relationship between the equation of motion and the energy:

duij = − ∂E ∫i.e., duij
dt ∂Vij E=− dVij
dt

In many applications, it may be easier to write the equation of motion first from the
given problem. Derive or conjecture the energy function in such a way that when it
is differentiated, the result gives the equation of motion. If this is successful, it will
be easier than directly integrating the equation of motion to obtain the energy
function, since differentiation is typically easier than integration.

For example, in the above n-queen problem, differentiation of the first term of
energy function leads to the corresponding first term of equation of motion as
follows.

3.5 The Hopfield- Tank Model for Optimization Problems: Applications 55

⎡1 2 ⎤
∑ ∑ ∑∂ ⎢ n ⎛ n 1⎞⎟⎠ ⎥ ⎛ n − 1⎠⎟⎞
⎣2 i =1 ⎜⎝ − ⎦ → − ⎝⎜
∂Vij Vik Vik

k =1 k =1

For example, for n = 2

n∑ ∑⎡1 ⎛ n ⎞ 2 ⎤ 1
i =1 ⎝⎜ k =1 ⎠⎟ ⎥ 2
⎢ Vik − 1 ⎦ = {(V11 + V12 - 1)2 + (V21 + V22 - 1)2}.
⎣2

Then, for example,

∂∑ ∑ { }( )−⎡1 n ⎛ n ⎞ 2 ⎤ −1 ⋅ ∂V 21
∂V 21 i =1 ⎝⎜ ⎟⎠ ⎥ 2 ∂V 21
⎢ 2 Vik − 1 ⎦ = 0+2 V 21 + V 22 − 1
⎣
k =1

∑( )= − ⎛ 2 − 1⎞⎟⎠
V 21 + V 22 − 1 = − ⎜⎝ k =1 V 2k

3.5.3 Traveling Salesman Problem (TSP)

The traveling salesman problem (TSP) is an NP-complete problem, notorious for its
time complexity. For this reason, the TSP has been chosen as a popular bench mark
problem to test the effectiveness of many new techniques. Basically, earlier
techniques are based on exhaustive search, which means "try all" to find an optimal
solution. Although other popular techniques such as dynamic programming and
branch-and-bound algorithms are better than pure exhaustive search, their principles
are improvements over exhaustive search by cutting fruitless branches in the search
space.

The Hopfield-Tank model is based on a totally new idea, very easy to implement
on a parallel computer, and appears to find solutions very fast. This is why there was
so much excitement when this model was announced and demonstrated to solve the
TSP. The solutions are not guaranteed to be optimal, but they are said to be good
enough for practical applications [Hopfield and Tank, 1985].
The basic idea of applying the Hopfield-Tank to the TSP is the same as we have
seen in this section, such as the n-queen problem. We have to define Vij appropriately
so that it represents our solution. We also have to define an equation of motion so that
it drives our solution in a desired direction. In the following, we will discuss the basic
idea using a simple example.

Formulation of the Hopfield-Tank model for the TSP

Problem Given an undirected weighted graph, find a shortest tour by visiting every
vertex exactly once.

56 3 Neural Networks: Other Models

We will illustrate the idea by using a specific example of n = 4 cities (Fig. 3.8). Of
course, the method can be applied to any problem with any value of n.

Fig. 3.8 A TSP example of n = 4 cities.

Define the "distance" matrix, [dij], to represent the distance between each pair of
the cities. The matrix is always symmetric, so the upper triangular matrix is sufficient.
When two cities are not directly connected, an arbitrarily large distance can be
assigned so that any solution connecting unconnected cities will disappear soon
because of the penalty. For our specific example, the distance matrix will be given as
follows:

Distance matrix

City 1 2 3 4
City
⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯
1 0 25 17 15

2 0 20 10

3 0 30

40

One optimal solution is given as:

17 15 10 20 Total
62
3 ⎯⎯⎯ 1 ⎯⎯⎯ 4 ⎯⎯⎯ 2 ⎯⎯⎯ (3)

We note that there is always another corresponding optimal solution for each optimal
solution by traversing the route in the opposite direction (in the above, 3 — 2 — 4
— 1 — (3)).

Our problem is to determine an order of cities to be visited that gives the minimum
traveling distance. To represent a solution (not necessarily optimal) which shows the
order of cities to be visited, we consider an n × n "solution" matrix. For our example,

3.5 The Hopfield- Tank Model for Optimization Problems: Applications 57

if a solution is to visit Cities No. 3, 1, 4, 2 (and 3) in this order, then the 4 × 4 matrix
will be:

A solution matrix

Position 1234

City

⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯

1 0100

2 0001

3 1000

4 0010

Here entry 1 means visit; 0 means not visit. More precisely, an entry 1 in the i-th row,
j-th column means City i is the j-th city to be visited. For example, City 3 will be
visited first, City 1 will be visited second, and so on. Note that there is exactly one 1
in each row and column. We define VXi, 0 ≤ VXi ≤ 1, to represents the degree of
visiting City X at Position i (i.e., as i-th city). If VXi = 0 then not visit City X as the i-th
city; if VXi = 1 then visit; if VXi is between 0 and 1, say 0.3, then visit with degree 0.3
(although in real life, you cannot do this). For example, during iteration processes,
our solution matrix can look as follows:

Position i 1234

City X

⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯

1 0.1 0.8 0 0.1

2 0.2 0 0.2 0.8

3 0.7 0.1 0 0.2

4 0.3 0.4 0.9 0.2

The basic equations can be defined as:

∑ ∑ ∑ ∑duXi=−uXi − − − C ⎛ ⎞
τ ⎜ VYj − n ⎟
dt ⎝ j⎠
A VXj B VYi

j≠i Y≠X Y

∑ ( )−D
dXY V + VY , i + 1 Y,i −1 .

Y

58 3 Neural Networks: Other Models

{ }VXi = g(uXi) = 1 1+ tanh uXi .
2 u0

∑ ∑ ∑ ∑ ∑ ∑E = 1 A
1 VXiVYi
VXiVXj + B
2 X i j≠i 2 i X X≠y

∑ ∑ ∑ ∑ ∑ ( )+1⎛ − ⎞ 1
2 C ⎜⎝ VXi n ⎟⎠ 2 + 2 D dXYVXi V + VY , i + 1Y,i −1

X i X Y≠X i

The coefficients A, B, C, and D, and u0 are constants (for example A = B = D = 500,
and C = 200). We have five terms on the right-hand side of the first equation of
motion. The first term is called biological term. The value of τ can be set to 1 without

loss of generality (see [Hopfield and Tank 1985]. The second term drives to have

only one 1 in Row X (if other elements are 0, then keep the current value of uXi;
otherwise, reduce uXi so that VXi becomes 0). The third term is the same as the second
except that it is for Column i. The fourth term is for having exactly n 1's in the entire

solution matrix.

The last term is to minimize the total distance to be traveled by the solution. If VXi
and VY, i+1 are close to 1, the degree of visiting these two cities in sequence (ith visit
for X and (i+1)st for Y) is high. If dXY, the connecting distance between cities X and
Y, is also large, it is better to get rid of VXi by reducing uXi. Similarly, VY, i-1 is for
(i-1)st visit for Y and ith visit for X. In our example, suppose that X = 1 and i = 2; then

∑ ( ) ( ) ( ) ( )d1Y VY 3 + VY1 = d12 V 23 + V 21 + d13 V 33 + V 31 + d14 V 43 + V 41 .

Y

For example, if VXi = V12 and VY, i+1 = V23, and if both are close to 1, then the degree
of visiting these two cities in sequence is high: X = 1 for the 2nd city and Y = 2 for the
3rd city. Since dXY = d12 = 25 is large, we would try to remove VXi = V12.

3.6 The Kohonen Model

The Kohonen neural network model possesses interesting characteristics such as
self-organization and competitive learning. The major objective in this section is to
study these characteristics in the Kohonen network.

Background

One interesting aspect in the study of neural networks is how a neural network learns.
The backpropagation model, discussed in the previous chapter, performs supervised
learning. For each input pattern, the neural network computes its output. At the same
time, the neural network is given correct output by the teacher, compares it with its
own output, and tries to learn how to make its output closer to the correct output. This

3.6 The Kohonen Model 59

supervised learning process is like a private lesson with a tutor. Each time the student
comes up with an output (for example, pronouncing a word or carrying out an
arithmetic computation), the tutor immediately gives the correct answer.

A variation of this supervised learning is called graded or reinforcement
learning. In this method, the neural network is not given correct output
corresponding to each input. Instead, the neural network is occasionally given a
"grade" or "score" for overall performance for its outputs since the last time it was
graded. Graded learning is analogous to a classroom situation, where students are
occasionally given quizzes and exams, and the resulting scores reflect their overall
performance.

When we think of the way humans learn, the above-mentioned supervised and
graded learning are certainly important. But human learning is not limited only to
these forms and involves much more. For example, a baby gains a tremendous
amount of knowledge early on, such as how Mom, Dad and other objects around the
baby look, sound, smell and feel. Obviously the baby does not learn this by being told
what is correct and what is not. In other words, humans have the ability to learn
without being supervised or graded. It will be interesting to model such human
learning in neural networks, as, for example, in the form of unsupervised learning
(that is, neither supervised nor graded). The term unsupervised learning refers to a
method in which the neural network can learn by itself without external information
during its learning process.

Self-organization is an unsupervised learning method, where the neural network
organizes itself to form useful information. This method may model some of the
human learning processes discussed above, which are neither supervised nor graded.

In competitive learning, neurons (or connecting edges) compete with each other.
The winners of the competition strengthen their weights while the losers' weights are
unchanged or weakened. The idea is somewhat similar to the principle of evolution,
which will be discussed in the next chapter on genetic algorithms; the winning
species in the evolution process survives while the losers become extinct. By
employing competitive learning, the neural network can self-organize, achieving
unsupervised learning. These terms "self-organization" and "competitive" learning
will become clearer when we study the Kohonen network in the following.

Architecture

The architecture of the Kohonen network is shown in Fig. 3.9. It is a multilayered
network of two layers; the first is the input layer and second is the output layer, called
the Kohonen layer. The neurons in the Kohonen layer are called Kohonen neurons.
Every input layer neuron is connected to every Kohonen neuron, with a variable
associated weight. The network is non-recurrent, that is, feedforward (input
information propagates only from the left to right). Continuous (rather than binary or
bipolar) input values representing patterns are presented sequentially in time through
the input layer, without specifying the desired output. Each pattern is represented in
the form of a vector, x = (x1, x2, ..., xn).

60 3 Neural Networks: Other Models

Fig. 3.9. The Kohonen network architecture.

In the above figure, the neurons in the Kohonen layer are arranged in one
dimension. They can also be arranged two-dimensionally. In both one and two-
dimensional Kohonen layer configurations, a "neighborhood parameter" or "radius,"
r can be defined to indicate the neighborhood of a specific neuron. Fig. 3.10
illustrates examples of defining radiuses for one and two-dimensional Kohonen layer
configurations. Neighbors do not wrap around cyclically from one end to the other,
i.e., missing end neurons, if any, are not considered. For two-dimensional
configurations, other shapes such as hexagons can also be defined. There are
variations for the Kohonen network other than the model described here in the
architecture and computational procedure.
Computational procedures
Initialization (Step 0).
Assign small real random values to weights, wij for i = 1 to n and j = 1 to m.
Initialize the following two parameters:

A neighborhood parameter, radius r (e.g., r = 3). (See Fig. 3.10.)
A learning rate, α, where 0 ≤ α ≤ 1 (e.g., α = 0.8).
Iterations
Repeat the following Steps 1 through 5 for a sequence of input vectors, x's, drawn at
random.
Step 1. Enter a new input vector, x = (x1, x2, ..., xn), to the input layer.

3.6 The Kohonen Model 61

(a)

(b)

Fig. 3.10.Neighborhood examples of neuron z in Kohonen layers represented in terms of
radiuses. (a) A one-dimensional configuration. (b) A two-dimensional configuration.

The following Steps 2 and 3 perform competitive learning.

Step 2. Selection of a winning Kohonen neuron.

The Kohonen neurons compete on the basis of which of them have their

associated weight vectors, wj = (w1j, w2j, ..., wnj), "closest" to x, as
measured by a "distance function," D(wj, x). Each Kohonen neuron, j
for j = 1 to m, calculates its distance as D(wj, x). There are different
choices for the function form of D(wj, x). A common form is:

n wij − xi 2 .
∑ ( )D(wj, x) =

i =1

The winning Kohonen neuron is the one with the smallest distance.

62 3 Neural Networks: Other Models

Step 3. Weight modification.

For all neurons, j, within a specified neighborhood radius of the
winning neuron, adjust the weights according to the following formula:

Step 4. wj(t+1) = wj(t) + α(x(t) - wj(t))

This weight modification moves the weights associated with the
winning neurons a fraction of α of the way from wj to x. For example,
in extreme cases, if α = 1 then wj(t+1) will change to x(t); if α = 0 then
wj(t+1) will remain as wj(t). For a typical value of α, which is between 0
and 1, wj(t+1) will be between x(t) and wj(t).
For all the remaining (losing) neurons, the weights are unchanged, i.e.,
wj(t+1) = wj(t).

Update learning rate α. Typically, α is gradually reduced over
iterations.

Step 5. Slowly reduce radius r at specified iterations.

Example.

We consider a simple scenario of a one-dimensional Kohonen neural network with
three input layer neurons x1, x2, x3 and 10 Kohonen layer neurons y1, y2, ..., y10. We
interpret each input vector x = (x1, x2, x3) as representing a color: (1, 0, 0) = red, (0,
1, 0) = yellow, (0, 0, 1) = blue, (1, 1, 0) = orange, etc. We may have a sequence of,
say, 20 input vectors.

Input vector No.17 may randomly be picked up first, representing red, x = (x1, x2,
x3) = (1, 0, 0). We will determine the winning Kohonen neuron, i.e., yj that “best
represents” this x. To do so, we compute 10 associated distances corresponding to the
10 Kohonen neurons and find yj that gives the minimum distance. Perhaps y6 is the
winner; the associated weight vector (w16 , w26 , w36) = (0.9, 0.1, 0.2) is closest to this
input vector, x = (x1, x2, x3) = (1, 0, 0), and its associated distance is:

3

∑Distance = (wi6 - xi)2 = (0.9 - 1)2 + (0.1 - 0)2 + (0.2 - 0) 2 = 0.06.

i =1

Next, we adjust the weights for neighboring ys; e.g., y5, y6, y7 if r = 1. The new
weights will be between the current weights and x. Assuming α is 0.8, the new
weights associated with y6 will be:

w16(new) = w16 + α(x1 - w16) = 0.9 + 0.8 * (1 - 0.9) = 0.9 + 0.08 = 0.98.
w26(new) = w26 + α(x2 - w26) = 0.1 + 0.8 * (0 - 0.1) = 0.1 - 0.08 = 0.02.
w36(new) = ......

We notice, for example, that the new weight w16 = 0.98 is between the current weight
w16 = 0.9 and x1 = 1. The new weights associated with y5 and y7 will also be computed
similarly.

3.7 Simulated Annealing 63

The second input vector may be No. 12 and represent blue, x = (0, 0, 1). Perhaps
y3 is the winner. The third input vector may represent red, x = (1, 0, 0). …… After 20
input vectors (an epoch), reduce α from 0.8 to e.g., 0.8 * 0.9 = 0.72. After more
epochs, the weight vectors converge, i.e., the neural network self-learns by clustering
the three colors. If input is red, y6 fires; if input is blue, y3 fires; and so on.

What does the Kohonen network accomplish?

The network stores the presented input vector patterns through modification of
weight vectors. After enough input vectors are presented, the weight vectors become
densest where input patterns are most common, and become least dense where input
patterns are rare. That is, the effect is to cluster (or categorize) the input patterns. The
density distribution of the weight vectors tends to approximate the density
distribution of the input vectors. In addition, similar input patterns will be classified
in the same cluster and will fire the same output neurons. The input patterns are
stored and classes are found by the network itself without a teacher or ideal output
patterns; this is the idea of self-organization and unsupervised learning.

Application examples

A neural phonetic typewriter

Kohonen [1988] shows a speaker-adaptive speech recognizer using a Kohonen
neural network. Spoken words are presented to the neural network through a
microphone with some pre-processing. The output neurons are labeled with
phonemes. Basically, the network trains itself to map the input to output.

Data compression as a vector quantizer

A Kohonen network can be used to compress and quantize data, such as for speech
and images, before storage or transmission to reduce the amount of information to be
stored or sent. The principle is to categorize a given set of input patterns into classes,
and represent any pattern by the class into which it fits. Generally, a neural network
learning process of this type, to divide a set of input patterns into disjoint classes, is
called learning vector quantization.

3.7 Simulated Annealing

Simulated annealing is a general technique for optimization problems in many
application domains. Our interest in this chapter is its application to neural networks,
particularly to Hopfield networks, but simulated annealing can be employed in many
other areas. Loosely speaking, Boltzmann machines, which will be discussed in the
next section, are extensions of Hopfield networks in which the simulated annealing
technique is incorporated to achieve optimization.

What is simulated annealing and why do we use it?

64 3 Neural Networks: Other Models

When we solve a hard optimization problem by an iterative process, the solution may
prematurely stuck in an undesirable local minimum before reaching the global
minimum (or an acceptable local minimum), as illustrated in Fig. 2.10. Simulated
annealing is a technique to avoid undesirable local minima. In general, many iterative
optimization techniques may determine a solution that minimizes or maximizes a
certain objective function of a system, such as total error, energy, cost, or profit.
Minimization and maximization are technically identical, i.e., exactly the same
technique can be used – to minimize we go down and to maximize we go up.

The basic idea of employing iterative techniques is as follows. If we can find a
solution that minimizes the objective function in one step, that would be the best. In
order to find an x that minimizes the function f(x) in calculus, we set the derivative
f'(x) = 0. This is a one-step procedure. For many difficult problems, however, this
one-step approach does not work; in such cases, we employ an iterative procedure,
such as that discussed earlier for the backpropagation model and Hopfield networks.
Typically, we start with an arbitrary, often randomly selected, solution and improve
the solution incrementally through many iterations. In each step f(x) decreases
slightly, and hopefully f(x) reaches the absolute bottom. A common problem of this
technique is that the solution may prematurely terminate upon reaching a local basin
that is located at a high elevation, and we may mistakenly conclude that this is the
best solution.

Simulated annealing tries to overcome this local minima problem by incorporating
probabilistic, rather than strictly deterministic, approaches in search of optimal
solutions. The term “probabilistic” is also called “statistical” or “stochastic,” all of
which describe the same concept in this context. The name "simulated annealing" is
used since it is analogous to a gradual cooling or annealing process of a metal or
another substance to its lowest energy level, resulting in a crystal. If the metal is
heated to a high temperature, it melts. If the metal is cooled quickly (quenching),
atoms or molecules bind together without reaching the lowest binding energy levels,
leading to an amorphous or defective state. This is analogous to an iterative process
that is trapped in an undesirable local minimum. We want to produce a crystal state,
a global minimum, by annealing. Therefore, we will set the system temperature high
at the beginning, gradually reducing the temperature, making certain that the system
is near thermal equilibrium at each temperature.

The algorithm discussed here is analogous to physical phenomena, especially
statistical mechanics. Energy E is a measure that is used to determine whether a
minimum (or maximum) solution has been reached. E can represent total error, cost,
profit, etc., depending on the specific application. Typically, E does not represent
physical energy. Similarly, temperature T is another measure often called
pseudo-temperature, a parameter to perform simulated annealing computer
algorithms. That is, T is analogous to temperature in thermodynamics, but typically
it is not physical.

Probabilities of states in terms of the energy and the temperature

We briefly overview the statistical mechanics aspect from which simulated annealing
is conceived. In thermodynamics, the probability of finding the system in a particular
state with energy E and temperature T is proportional to the Boltzmann probability
factor:

3.7 Simulated Annealing 65

E
−

e KT

where k is the Boltzmann constant, 1.3896 × 10-23 joule/kelvin, and T is a
measurement in kelvin = ºC + 273.15. Consider two states S1 and S2, with energy E1
and E2, and the same temperature T. The ratio of the probabilities of the two states is
as follows:

⎜⎛⎝ − E1 ⎠⎟⎞ = exp [E1 − E 2] .
kT
P ( S1) P ( S 2) = exp
⎜⎝⎛ E2 ⎟⎞⎠ kT
exp − kT

For example, a molecule of gas in the earth's atmosphere has its lowest energy at the
sea level of 0 meters and higher energy at a higher altitude. But the probability of
finding the molecule 10 meters above sea level is about the same as at the sea level,
because the energy difference [E1 - E2] is very small in comparison with kT, i.e., the
probability ratio of these two levels is exp(-[E1 - E2]/kT) ≈ exp(- 0/kT) ≈ 1. But when
the altitude becomes much higher, say, 10,000 meters, the probability difference is
significant. At such a high altitude, significantly fewer molecules exist. The
probability of a molecule, or, equivalently, the number of molecules, decreases
exponentially with the altitude.

Now, as a hypothetical scenario, assume that the temperature is increased 1,000
times. That is, T in exp(-[E1 - E2]/kT) is 1,000 times larger. Under such circumstance,
the probability of finding the molecule would be much higher, even at a high altitude.
The probability at 10,000 meters is the same as at 10 meters at the original
temperature. Accordingly, when temperature is high, the system explores a large
number of possible states, ranging at low to high altitudes.

Simulated annealing adopts this thermodynamics concept. We do not need to keep
the Boltzmann constant, k, or to measure E in joules and temperature in kelvin, since
we are not dealing with physical systems. We can drop k by selecting an appropriate
ratio between E and T. We start with a high temperature so that a large number of
possible solutions can be explored at an early stage. We lower the temperature
gradually, as during metal annealing, ensuring a low-energy solution at each
temperature.

Simulated annealing algorithm
__________________________________________________________________

Algorithm

Step 0. Initialization.
Randomly select a solution vector x. Set the temperature parameter T to
T0. We may select T0 large enough in comparison with a representative
| ΔE | so that e-ΔE/T is sufficiently close to 1. ΔE is defined in the next
Step.

66 3 Neural Networks: Other Models

Step 1. Beginning of the outer and inner loops.
Compute xp, a perturbed (slightly changed) solution of x.
Determine ΔE = E(xp) – E(x), the change in the energy (objective)
function.

Step 2. Select the current x (i.e., no change) or xp (i.e., the perturbed new
solution) based on the following criteria for a new x of the next time

step.
Case 1. ΔE < 0, i.e., xp is better than x. Select xp.
Case 2. ΔE ≥ 0, i.e., xp is not better than x. Select xp with the
probability of e-ΔE/T, and x with the probability of 1 - e-ΔE/T. We can

implement Case 2 by picking up a random number r on [0, 1], then
selecting xp if r < e-ΔE/T, and selecting x otherwise.

Step 3. Repeat Steps 1 and 2 until | ΔE | becomes small enough, i.e., the system
is near equilibrium at this temperature. (Alternatively, repeat until the
number of iterations exceeds a predetermined maximum number).

Step 4. Reduce the temperature T and repeat Steps 1 through 3 until T reaches
zero or a small positive number. Possible schemes for reducing the
temperature will be discussed later.

__________________________________________________________________

In the above algorithm, Step 2, Case 2 is a key to simulated annealing. This action
forces the relative probabilities of the two states of xp and x, that differ in energy by
(E, to match the Boltzmann distribution [Kirkpatrick, 1988]. The idea of the process
is that even if (E ( 0, i.e., the new solution is the same or worse than the current one,
we still select the new solution with a certain probability. This helps to escape from
local minima by not focusing solely on downward movement. The probability of this
selection is high when temperature T is large, since e-ΔE/T ≈ e-ΔE/∞ ≈ e0 ≈ 1. The
probability becomes smaller when T gradually decreases, and for T → 0, the
probability e-ΔE/T = 1/eΔE/T ≈ 1/e∞ → 0. (When both T and | ΔE | are exactly zero, e-ΔE/T
is undefined. We can avoid this extreme case by stopping the algorithm when (E and
T are sufficiently small.) The effect of this probability change is that, starting at a
high temperature, we try to jump out of local minima more aggressively during an
early stage. Later, presumably most, if not all, local minima have been escaped, and
when the system is close to a global minimum, we perform a more gentle minimizing
process.

Example. Traveling salesman problem (TSP)

N = 5 cities are randomly scattered within a 1.0 ( 1.0 square area, and numbered 1
through 5 (Fig. 3.11.) Each solution vector x is a permutation of N numbers. E(x) is
the total distance for solution x.

3.7 Simulated Annealing 67

Figure 3.11. A randomly selected solution example of a traveling salesman problem of five
cities. The solution vector x = (1, 3, 2, 5, 4).

Application of the algorithm

Step 0 Randomly select a solution vector x, such as (1, 3, 2, 5, 4), for the 0th
Step 1.
iteration. Set T to T0.
Step 2. Compute xp, a perturbed solution of x. For example, xp may be obtained
by randomly swapping two cities in x. Perhaps it is (1, 3, 4, 5, 2) for the
Step 3. first iteration. Determine ΔE = E (xp) – E(x), the change in the total
Step 4. distance.
Case 1. If ΔE<0, i.e., xp is a better solution than x, select xp as a new x
for the next step.
Case 2. IF ΔE ≥ 0, select xp with e- ΔE/T probability, keep current x with
1-e- ΔE/T probability.
Repeat Steps 1 and 2 until |ΔE| is small enough.
Reduce T by, for example. (new T) = 0.9 × (current T). Repeat Steps 1

though 3. Terminate the entire algorithm when T reaches zero or a small

number.

My graduate student Bob Crichton performed numerical experiments on TSP, by
selecting various values of parameters such as the number of cities (5, 10, 15, 20),
initial temperature T0 (e.g., 0.02, 0.05), |ΔE| min in Step 3 (e.g., 0.1, 0.21, 0.42), and
the coefficient of the temperature reduction formula in Step 4 (e.g., 0.9, 0.95, 0.99).
He also experimented with a slightly different way of obtaining a perturbed solution
in Step 1 – randomly picking two cities then reversing the order of the cities between
them including the two cities. For example, if Cities 3 and 5 are picked in 6-city
solution (1, 3, 4, 6, 5, 2), then the perturbed solution will be (1, 5, 6, 4, 3, 2). This
approach often yields more moderate changes in the total distance than the one
described in Step 1. This is because the former involves changing two distances (in
the above example, the distances between 1 and 3, and 5 and 2), where as Step 1
involves changing four distances.

Fig. 3.12 shows a typical run to obtain an optimal solution for 10 cities with T0 =
0.021, |ΔE| min = .021, and (new T) = 0.9 × (current T), employing Step 1 algorithm
for a perturbed solution. This run required a few seconds of runtime on a 2 Ghz PC.

68 3 Neural Networks: Other Models

It is interesting to observe the energy changes in Fig. 3.12 (a) where local minima
were escaped due to the simulated annealing effect. When we select too small a
coefficient value in Step 4 (e.g., 0.8), temperature decreases rapidly, and
consequently the simulated annealing effect declines quickly and we may not be able
to reach an optimal solution.

11
0.8 0.8
0.6 0.6

0.4 0.4

0.2 0.2

00

0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1

(a) (b)

5 0.03

4

0.02
3

Energy
Temperature
2
0.01

1
Energy
Temperature

00
1 122 243 364 485 606 727 848 969 10901211133214531574
Iterations

(c)

Fig. 3.12. Simulated annealing experiment on TSP with 10 cities. (a) Initial solution.
(b) Optimal solution. (c) Total energy (distance) and temperature changes over iterations.

Considerations on reducing the temperature

In Step 4 of the algorithm, we reduce the temperature T, that is, we perform annealing
(cooling). Various schemes have been proposed for this process. There is a tradeoff
between the cooling speed and finding the optimal solution. That is, if the cooling
speed is very slow, the global minimum may be guaranteed, but it may require much,

3.8 Boltzmann Machines 69

often impractically long, computation time. The following is such a very slow
cooling scheme [Geman and Geman, 1984]:

Tt = T0/log (1 + t) t =1, 2, …

where t represents the tth iterate of the outer loop.
On the other hand, if the cooling speed is faster, the computation time will be shorter,
but finding the global minimum may not be guaranteed. Even so, often near-optimal
solutions are practically good enough. The following is such a scheme used in Step
4 of Example (TSP) above [Kirkpatrick, 1988]:

Tt = αTt-1
where the reducing factor α, 0< α<1, is typically 0.8≤ α ≤0.99.

3.8 Boltzmann Machines

3.8.1 An Overview

A Boltzmann machine can be defined as an extension of the Hopfield network. One
major distinction between these two models is the ways the states of the neurons are
updated. In the Hopfield network, the neurons are updated according to a
deterministic formula (xi(t + 1) = 1 if neti(t) > θi, …, given in Section 3.3). Note that
the term deterministic refers to the equations to change the states of neurons, rather
than how the neurons are selected for the update. Similarly, ui for the Hopfield-Tank
model is evaluated in a deterministic manner. In the Boltzmann machine, the states
of the neurons are updated in a stochastic way using simulated annealing. The
probabilities in simulated annealing are given by the Boltzmann distribution in
statistical mechanics; hence the name Boltzmann machine. Additional major
distinctions between the Boltzmann machine and the Hopfield network are: the types
of learning and the existence of hidden neurons [Ackley, et al, 1985].

Summarizing the three major differences between the Boltzmann machines and
the Hopfield network are (more explanations below):

1. Neuron update. The Boltzmann machines - stochastic using simulated annealing,
while the Hopfield network - deterministic.

2. Learning forms. The Boltzmann machines can be either supervised or
unsupervised. The Hopfield has only one learning form.

3. Hidden neurons. The Boltzmann machines have hidden neurons, but the Hopfield
network does not.

The Boltzmann machines can be classified into two types based on the ways the
networks learn – supervised and unsupervised (sometimes called self-supervised).
The supervised version is similar to the backpropagation model, but requires much
more computation time. For this reason the Boltzmann machine for supervised

70 3 Neural Networks: Other Models

learning is not extensively employed in practice and the details will not be discussed
in this book.

Another aspect of Boltzmann machines is how the neurons are grouped. In the
unsupervised model, the neurons are divided into two groups – visible and
hidden. In the supervised model, the neurons are also divided into two groups,
visible and hidden; furthermore, the visible neurons are subdivided into two
subgroups – visible input and visible output. In addition to the above two aspects,
namely the ways the networks learn and how the neurons are grouped, there are other
features that characterize Boltzmann machine models. They include: how neurons
are connected among and within the groups – fully or partial; and whether each
connection is bidirectional (i.e., recurrent) or unidirectional.

The subject of the Boltzmann machines is more complicated than other neural
network models we have seen. There are two major reasons for this situation. First,
as discussed above there are many choices for the network characteristics (e.g., the
network architecture and the ways the networks learn) leading to many Boltzmann
machine models. Different models are useful for various applications. Second, the
learning process of a Boltzmann machine requires more steps that involve various
aspects of computation. In the following we will discuss only the most widely used
models, in particular the unsupervised model.
The common computational characteristics of Boltzmann machines and the Hopfield
network are as follows.

1. the weights are symmetric, i.e.,wij = wji;
2. no self-feedback, i.e., wii = 0;
3. the state of each neuron is binary (0 or 1) or bipolar (-1 or 1), representing ON or

OFF, respectively;
4. the neurons are picked out randomly one at a time for update.

General problem description

Briefly stated, we are given a set of input patterns (vectors), collectively called the
environment (hereafter, we use the terms “patterns” and “vectors” interchangeably).
These input patterns are clamped (i.e., placed in effect forcefully) one at a time to the
visible neurons in the case of the unsupervised learning model and visible input
neurons in the supervised model. The objective is to match the probability
distribution of the environment (i.e., input patterns) and the probability distribution
of the network. We achieve this task by utilizing the hidden neurons and by training
the network through adjustment of weights. When it is done, the probability
distributions of the environment and the network will be the same. In effect, the input
patterns will be clustered in the network according to a Boltzmann distribution.

In the following we discuss the architecture, problem description, learning process
basics, and learning algorithm.

3.8.2 Unsupervised Learning by the Boltzmann Machine: The Basics

Architecture

Fig. 3.13 depicts a typical Boltzmann machine for unsupervised learning. The

3.8 Boltzmann Machines 71

neurons are classified into two groups – visible and hidden. Each neuron assumes
one of two values, +1 or -1. The neurons are fully connected, i.e., every neuron is
connected to every neuron in both groups. Every connection is bidirectional, i.e., the
network is recurrent. The visible neurons are clamped to external states when the
network interacts externally with its environment. The hidden neurons operate
internally to make the network learn the underlying constraints imposed by the
external inputs.

Figure 3.13. A Boltzmann machine architecture example of unsupervised learning.

There are K visible neurons at the bottom layer, whose state can be represented by
(x1, …, xK), and L hidden neurons at the top layer, whose state can be represented by
(xK+1, …, xK+L). Each xi assumes +1 or -1. The neurons are fully connected and each
edge is bidirectional (recurrent).

Let the number of the visible neurons be K and the hidden neurons be L. Then there
are a total of K+L neurons in the network (in Fig. 3.13, K = 4, L = 3, and K+L=7). The
state of these neurons can be represented by the state vector x = (x1, …, xK; xK+1, ...,
xK+L). We can divide x into two groups, visible and hidden, as xα = (x1, …, xK) and xβ
= (xK+1, …, xK+L), respectively. To explicitly indicate vector x contains both visible
and hidden neurons, we also write x as xαβ. A vector x represents a snapshot for a
state of the network. For example, in Fig. 3.13, perhaps x = (1, -1, -1, 1; 1, -1, 1).

Problem description
In words, the basic idea of the problem is as follows. The problem involves two
major components: a Boltzmann machine neural network and its environment (Fig.
3.14). The network has a type of architecture as previously depicted in Fig. 3.13, i.e.,
it consists of a set of visible neurons and a set of hidden neurons. The environment is
a set of input patterns. Only the visible neurons are directly affected by the
environment. The hidden neurons are affected by the environment only indirectly
through the visible neurons.

72 3 Neural Networks: Other Models Environment

Network

Visible neurons A set of input patterns

Hidden neurons

Fig. 3.14. The problem domain consists of a network and its environment.

The objective of the problem, in abstract terms, is to let the network create a model
of the structure implicit in the set of input vectors by using the hidden neurons. More
specifically, the implicit structure is the probability distribution of the set of input
vectors. We want this probability distribution to be closely (or exactly) realized by
the visible neurons when the network is running free from the environment. For
example, if we have a set of three input vectors, (1, -1, -1, -1), (1, -1, -1, -1), (-1, -1,
-1, 1) in the environment, the probability distribution will be 2/3 for (1, -1, -1, -1) and
1/3 for (-1, -1, -1, 1). When the model created by the network is perfect, the visible
neurons will have exactly the same probability distribution.

More generally, we are given a set of Q training input patterns: S = {xv1, ..., xvQ},
where these patterns are not necessarily distinct. Patterns can be repeated in
proportion to how often they are known to occur. Since each pattern has K
components (x1 through xK), the entire set S of Q training patterns consists of QK
components of 1s and -1s. Our task is to perform unsupervised learning on the
network so that the visible neurons and the input patterns are clustered in terms of
their probability distributions. Clustering means dividing the patterns into groups,
where similar patterns are placed in the same group while all the others are in
different groups. Clustering is performed by adjusting network weights wij as we will
see.

Simple example

K = 4, L = 2, i.e., the architecture is obtained by removing x7 in Fig. 3.13. A state of
the 6 neurons, where each neuron takes +1 or -1, can be represented by the state
vector x =(x1, …, x4; x5, x6). We can divide x into two groups, xα = (x1,…, x4) and xβ
= (x5, x6). We assume Q = 3 training input patterns: (1,-1,-1,-1), (1,-1,-1,-1),
(-1,-1,-1,1). The first two patterns are repeated. Set S of these input patterns is given
by S = {xv1, xv2, xv3} = {(1,-1,-1,-1), (1,-1,-1, -1), (-1,-1,-1,1)}, which consists of QK
= 3 × 4 = 12 values of 1 or -1.

The visible neurons can be interpreted as representing colors: x1 = red, x2 = yellow,
x3 = green and x4 = blue. (1,1,-1,-1) would represent orange. The above three input
patterns represent red, red, blue. When these three input patterns are presented and
the network successfully converges, the input patterns will be clustered into two
categories: red and blue.

Clustering

3.8 Boltzmann Machines 73

One immediate question may be what such a clustering action accomplishes. The
answer is basically the same as what a Kohonen network does (section 3.6). The
network stores the input patterns through modification of weights. The weights
become densest where input patterns are most common.

Suppose we observe spectral intensities of 100 stars and feed them to a network.
Perhaps 20 of them are classified as reddish, 30 are bluish, and 50 are whitish. The
network clusters the 100 stars into these three categories without being told by the
human how many categories are expected and what these categories are. Clustering
data can be applied in many domains. Medical records such as laboratory test results
can be analyzed. Data for patients with a certain disease will be clustered in a
different group from healthy people. Such information will help to diagnose new
patients with this disease. Clustering machines based on observed characteristics
may identify those that are about to fail. Clustering business firms based on their
records and recent activities may reveal those that are in trouble.

Learning process

For their learning rules, there are some similarities among the Boltzmann machine
and other neural network models, such as backpropagation. Initially the weights are
randomly assigned, and the networks learn by adjusting the weights. The adjustments
are performed over iterations, by slightly changing the weights each time by: wij(new)
= wij(current) + Δwij. In the case of the backpropagation model, Δwij is selected to
decrease the total error of the network in the steepest direction. Similarly, in the case
of the Boltzmann machine, Δwij is selected to decrease the relative entropy of the
network in the steepest direction. The relative entropy is a function of probabilities
of the network states, and it is analogous to the one in statistical mechanics. Using
these probabilities, it can be shown that the steepest direction is represented by the
mean correlations between neurons. Hence, during the course of iterations these
mean correlations between neurons are collected, Δwij is determined, and the weights
are adjusted. This is the essence of the learning process.

Network energy

Analogous to thermodynamics, we can define the energy of the Boltzmann machine
of a particular state as:

1 Σi Σj wij xi xj. (1)

E=-

2

In the above and the other expressions in this section, i ≠ j is assumed for double

summation in terms of i and j, since wii = 0. We note that the above expression can
also be written as - Σi Σj>i wij xi xj, by summing up for only the upper triangle for j.

Suppose that we pick out a neuron xk, and change xk to –xk (i.e., flip +1 and -1). We
want to determine ∆Ek, the change in the energy of the entire network due to such a

flip. Let E' be the energy after the flip. We can consider a two-dimensional matrix

whose elements are wij xi xj. The only elements affected by the flip are those in the kth
row and kth column, and their summations are: - (1/2) (- xk) Σj wkj xj - (1/2)(- xk) Σi
wik xi = -(- xk) Σi wik xi. In the last step we used wkj = wjk and replaced dummy index j

with i. The counterpart term in E before the flip can be obtained by simply replacing

74 3 Neural Networks: Other Models

(- xk) with (xk), i.e., the term is -(xk) Σi wik xi. Hence,

∆Ek = E' - E = 2 xk Σi wik xi. (2)

(Some authors define ∆Ek = E - E' = -2 xk Σi wik xi. In a binary system where each
neuron xk assumes 0 or 1, we can determine the energy difference when xk = 1 is
changed to 0: ∆Ek = (E for xk = 0) - (E for xk = 1) = Σi wik xi.)

State probabilities of individual neurons

During the course of the learning process, neurons are selected at random and
updated according to the following probabilities. Change xi to –xi with the
probability:

1 =1
∑P(xi → -xi) = (3)
⎝⎜⎛ ⎜⎝⎛ +Ei ⎟⎠⎞ ⎞⎟⎠
1 + exp T ⎝⎛⎜1 + exp ⎛ 2xi wijxj ⎞ ⎞
⎝⎜ T ⎟⎠ ⎟⎠
i

where ∆Ei is the change in the energy of the entire network due to such a flip given in
equation (2) and T is the current temperature for simulated annealing. xi stays the
same with the probability:

P(xi → xi) = 1 - P(xi → -xi) = 1

⎜⎝⎛ 1 + exp ⎛⎝⎜ +Ei ⎠⎞⎟ ⎟⎠⎞
T

= 1 (4)

∑⎛ 1 + exp ⎛ -2xi ⎞ ⎞w xij j
⎜⎝ T ⎟⎠ ⎟⎠
⎜⎝ j

We note that the right-hand sides of equations (3) and (4) add up to 1. In equation (3),
when ∆Ei = 0, the probability is equal to 0.5. When ∆Ei > 0, the probability is < 0.5;
the probability approaches 0 as ∆Ei increases (i.e.: xi does not flip). When ∆Ei < 0, the
probability is > 0.5; it approaches 1 as ∆Ei decreases (i.e.: xi flips). These make sense

in order for the network energy to decrease. The effect of temperature T is that when
it is large, | ∆Ei/T | is small, so it does not produce as much of an impact on the
probability of whether it will flip or not flip. When T gets small, | ∆Ei/T | becomes

large, and so has a greater impact on determining whether it will flip or not flip.

These features agree with the spirit of simulated annealing.

Equations (3) and (4) are consistent with the probabilities of the state of a neuron,

either +1 or – 1, which is determined in stochastic way as follows:

xi = +1 with probability pi (5)
where -1 with probability 1 – pi

3.8 Boltzmann Machines 75

pi = 1 , 1 − pi = 1 . (6)

1 + exp(− ∑2 wijxj ) 1 + exp( ∑2 wijxj )

Tj Tj

Again we note that the right-hand sides of the above expressions add up to 1. The

consistency among equations (3) through (6) can be checked as follows. Regardless

of the current xi, i.e., xi = 1 or -1, we see that the probability of having a new xi = 1 is

equal to pi by applying equation (4) with xi = 1 or (3) with xi = -1. Similarly, we see

that the probability of having a new xi = -1 is equal to 1 - pi, regardless of the current

xi. Alternatively, we can compute P(xi = 1 after update) = P(xi = 1 before update) ×
P(xi → xi) + P(xi = -1 before update) × P(xi → -xi) and P(xi = -1 after update) = P(xi
= -1 before update) × P(xi → xi) + P(xi = 1 before update) × P(xi → -xi).

Free-running (-) and clamped (+) phases

The learning process of the Boltzmann machine includes the following two major
phases:
• Free-running phase (also called negative phase, and the “-“ sign is associated
with this phase), where the network operates freely without the influence of the input
patterns.
• Clamped phase (also called positive phase, and the “+” sign is associated with
this phase), where the input patterns are clamped to network’s visible neurons.
These two phases are performed alternately as -, +, -, +, ….

We label the states of the visible neurons with α, and those of hidden neurons with
β. With K visible neurons, α runs from 1 to 2K as (1, ..., 1, 1), (1, ..., 1, -1), ..., (-1, ...,
-1, -1). Similarly, with L hidden neurons, β runs from 1 to 2L. A state of the whole
system of K + L visible and hidden neurons is specified by an α and a β as one of 2K+L
possible states; we denote this specific state as αβ, a concatenation of α and β. Note
that although the two letters α and β are used, αβ represents a single state of the whole

system. Let vector xαβ represent a particular state of the network involving all the

visible and hidden neurons. For example, in Fig. 3.13, xαβ = (1, -1, -1, 1; 1, -1, 1) can
be one of the 27 = 128 possible different states.

We place superscript - for a free-running phase and + for a clamped phase. P(xα)-
represents the actual probability of finding the visible neurons in state α at

equilibrium in the free-running (-) phase. Hereafter we use simpler notation and write

P(xα)- as Pα-. Pα+ is the desired probability of finding the visible neurons in state α

at equilibrium in the clamped (+) phase. Similarly, Pαβ- and Pαβ+ represent the

probabilities where the entire state is in state αβ at equilibrium in a free-running and

clamped phases, respectively. Pβ | - is the conditional probability at equilibrium
α

where the hidden neurons are in state β, given the visible neurons are in state α in a

free-running phase. Pβ | + is the same except that it is in a clamped phase.
α

Computing ∆wij based on the relative entropy

Based on information theory, we define the relative entropy G as a measure of the
difference between the distributions Pα- and Pα+, weighted by Pα+, as:

76 3 Neural Networks: Other Models

∑G = Pα + ln Pα + . (7)
α Pα −

G is always positive or zero, and is zero when Pα- = Pα+ for all α.

We select Δwij to decrease the relative entropy G in the direction of the steepest

gradient:

∆wij = −η ∂G . (8)
∂wij

It can be shown that the above equation leads to (see Appendix):

∆wij = η ⎣⎡< xixj >+ − < xixj >− ⎦⎤ (9)
T

where <xixj>+ and <xixj>- are the mean correlations between neurons xi and xj in the

clamped (+) and free-running (-) phases, respectively. The mean correlations take

real values on [-1, 1], and they are determined by taking the averages of xixj (See Step
5 in the following Learning algorithm).

3.8.3 Unsupervised Learning by the Boltzmann Machine: Algorithms

In the following, we summarize two algorithms, a step-by-step procedure for
unsupervised learning for the Boltzmann machine and a procedure for testing on the
same machine (Randall C. O’Reilly, 2006, private communication). There are
variations of these algorithms, and which one is best in terms of performance and
computing time depends on the application. We also give two illustrative examples.

Learning (training) algorithm
⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯
The algorithm consists of sextupled nested loops.

Step 0. Initialization:

Set weights wij to random values uniformly distributed over [-u, u], where u is
typically 0.5 or 1. Set the hidden neurons randomly to 1 or -1 with equal probability.

Step 1. Iterations over consecutive network convergences.

Repeat the rest of the algorithm until the network satisfies a convergence criterion
(e.g., equation (10) below) for a certain number (e.g., 5) of consecutive times. (See
Note below)

Step 2. Iterations over one-time network convergence.

Repeat the rest of the algorithm until the network converges one time. There are a
few different criteria for convergence check, and the following is a common one:

∑ ∑ (xi− − xi+ )2 < ε . (10)
ki

3.8 Boltzmann Machines 77

The outer summation is taken over for all input patterns, and the inner summation for
all the neurons in the network, i = 1, K + L. xi- and xi+ are xi values during the negative
and positive phases, respectively. ε is a preset small positive number. The number of
iterations of Step 2 is typically much higher than the number of distinct input patterns
because of this convergence condition. When the network satisfies criterion (10), we
say this is a “one-time network convergence”; back-up to Step 1.

Step 3. Iterations over input patterns.

A conclusion of the set of all input patterns is an epoch. After an epoch, go back to
the convergence check in Step 2.

For each input pattern, perform sub-steps a, b and c, each once.

a. Positive (clamped) phase. Clamp the input pattern to the visible
neurons and perform Step 4.

b. Negative (free-running) phase. Perform Step 4 without clamping the
input pattern to the visible neurons.

c. Updating weights wij's.

wij(new) = wij(current) + Δwij, where
Δwij = (η / Tf) (<xixj>+ - <xixj>-),

where η is a positive constant and Tf is the final temperature in Steps 4a
and 4b; <xixj> is the average of xixj collected during Step 4b; <xixj>+ is the
average for the positive phase and <xixj>- is for the negative phase. (See
Note below)

Step 4. Simulated annealing - iterations over temperatures.

a. For each of the clamped and free-running phases invoked in Step 3, perform
simulated annealing, starting from a high temperature T0 and gradually decreasing
the temperature T. For each temperature, perform Step 5. When T reaches a small
positive number Tf, the system is in thermal equilibrium at this temperature. Go to
Step 4b.

b. For each of the clamped and free-running phases invoked in Step 3, perform
additional iterations (say, 10 times) of Step 5, using the final temperature Tf in Step
4a. During these iterations, collect statistical data of xixj. After the iterations, compute
<xixj>+ and <xixj>- necessary in Step 3c. The number of additional iterations affects
the probabilistic accuracy of the collected data; the higher the number the better the
accuracy.

Step 5. Iterations at each temperature T given in Step 4.

78 3 Neural Networks: Other Models

Repeat Step 6 - updating neurons xi's, until all | ∆Ei | given in Step 6 become small
enough at the temperature T. (Here “all” refers to all the neurons xi under Items 1) or
2) in Step 6, below)

Step 6. Updating neurons, xi's.

Perform the inner-most iterations over
1) all the visible and hidden neurons for a negative phase invoked in Step 3a,
or
2) all the hidden neurons for a positive phase invoked in Step 3b.

Randomly pick out a neuron xi and change xi to –xi (i.e., flip +1 and -1) with
the probability:

11
∑P(xi → -xi) = =
⎜⎛⎝ ⎛⎝⎜ +Ei ⎟⎠⎞ ⎞⎟⎠ ⎛ ⎛ ⎞ ⎞w xij j
1 + exp T ⎝⎜ 1 + exp ⎝⎜ 2xi T ⎠⎟ ⎠⎟

j

where ∆Ei is the change in the energy of the entire network due to such a flip

and T is the current temperature set in Step 4. xi stays the same with the
probability 1 - P(xi → -xi) = 1/(1 + exp (-2 xi Σj wji xj/T)).

⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯

Note on Step 1.

The reason is that the network often “accidentally” gives a very small value or a zero
for the criterion – a false convergence indication even if it is not actually converged.
When iterations are continued further for such a case, the criterion value will return
to a larger value, indicating the network was not converged. There is no scientific
formula to determine the appropriate number of consecutive times in this Step. For
a given problem, we can experiment by selecting a relatively large number and
observing that the criterion value does not change in Step 1. The number can be
affected by several factors such as the size of the network, the number of input
patterns, and the degree of the complexity of each pattern. A rule of thumb is that the
more complex the environment (e.g., a large network), the fewer number of
consecutive convergences are required, because an accidental false convergence is
less likely. (Another convergence criterion can be changes of the network weights. If
all the weights do not change for a certain number (e.g., 5) of consecutive times, the
network is considered as converged.)

Note on Step 3c.

Alternatively, Step 3c of updating weights can be performed after each epoch, rather
than after each input pattern. Such a scheme will require significantly less computing
time when there are many patterns. However, the overall performance can be less
satisfactory for some problems. When weights are modified only after all patterns are

3.8 Boltzmann Machines 79

processed, delicate weight adjustments specific to each pattern may not be
accomplished. This modified version can be implemented by setting up a new step,
Step 3.5, between Steps 3 and 4. Steps 3a and 3b are moved to Step 3.5. New Step 3
performs iterations over patterns, invoking Step 3.5 for every pattern. After all
patterns are invoked, i.e., after an epoch, perform Step 3c once.

After a network converges for a set of training (learning or exemplar) patterns
employing the above Learning algorithm, new test patterns can be associated to the
network. The basic idea is the same as associative memory discussed in Section 3.2.
Test patterns can be generated by various ways: 1. noisy patterns, i.e., randomly
flipping +1 and -1 for some (e.g., 25% of) pixels of one of the training patterns; 2.
completely random patterns, i.e., picking all pixels randomly; 3. obtaining patterns
from experimental data. Most of these test patterns should converge to their closest
corresponding training patterns. However, some test patterns may converge to the
wrong training patterns, or to patterns different from any of the training patterns.
Such errors can occur depending on various factors such as the probability
distribution of the training patterns, the quality of the test patterns and the problem
complexity. The following is a possible algorithm to perform such testing.

Testing algorithm for each test pattern

⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯

Step 0. Initialization: Keep the weights wij obtained from the training session. Set
the visible neurons to the testing pattern and assign values of the hidden neurons
randomly.

Step 1. After the above, perform sub-step a (Negative phase) of Step 3 in the
Training Algorithm. This invokes nested iterations over Steps 4a, 5, and 6. Note that
in the Testing algorithm, sub-steps b. Positive phase and c. Updating weights are not
performed.
⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯

Illustrative examples

My graduate students Bob Crichton and Scott Galinac performed numerical
experiments on a simple Boltzmann machine, similar to Fig. 3.13. The network has
12 visible neurons and 6 hidden neurons. Three different input patterns are prepared
as shown in Fig. 3.15. One pattern of (a), two patterns of (b) and three patterns of (c)
are given as a set of training input patterns (i.e., the probability distribution of the
input patterns is 1/6, 2/6 and 3/6, respectively). For this specific run, we set T0 = 1 and
(new T) = 0.90 × (current T) in Step 4 of the algorithm, to a minimum temperature of
0.05. A typical run required about 5 consecutive iterations of Step 1 for the network
to converge with ε = 0 in Equation (10). After a successful run for network training,
different test patterns are given to the network. Most of these test patterns converged
to the corresponding training patterns.

80 3 Neural Networks: Other Models

(a) Blue ●●●●○○○○○○○○ (b)Yellow ○○○○●●●●○○○○ (c) Red ○○○○○○○○●●●●

Figure 3.15. Simple Boltzmann machine experiment, with 12 visible neurons and 6 hidden
neurons similar to Fig. 3.13. Three types of training input patterns (a), (b) and (c) are given to
the network with frequency of 1, 2 and 3, respectively (i.e., a total of six patterns). Here ●
represents +1, ○ represents -1.

Bob and Scott also experimented with a larger Boltzmann machine network with
120 visible neurons and 30 hidden neurons. Three different training input patterns
are prepared as shown in Fig. 3.16. As before, one pattern of (a), two patterns of (b)
and three patterns of (c) are given as a set of training input patterns. A typical run
required a single iteration of Step 1 for the network to converge with ε = 0. For the
second example with 150 neurons, it did not require any number of consecutive
iterations, i.e., only one-time network convergence in Step 1 was sufficient. For this
specific run, we set T0 = 3 and (new T) = 0.9 × (current T) in Step 4 of the algorithm,
to a minimum temperature of 0.01; η in Step 3c was set to a small number, e.g.,
0.0001 or 0.001.

(a) (b) (c)

Figure 3.16. Three types of training input patterns for a Boltzmann machine with 120 visible
neurons and 30 hidden neurons. The input patterns (a), (b) and (c) are given to the network
with frequency of 1, 2 and 3, respectively (i.e., a total of six patterns). A black pixel represents
+1, a white pixel represents -1.

After a successful run for network training, different test patterns are given to the
network. They are: 1. noisy patterns (training patterns with 25% of the pixels
reversed) and 2. completely random patterns. Most noisy patterns are converged
correctly to their corresponding input patterns. On one test for random input patterns,
100 patterns are generated randomly and given to the network. The probability
distribution of the convergence of the random patterns is close to that of the training
input patterns, i.e., approximately 1/6, 2/6 and 3/6 for patterns (a), (b) and (c),
respectively. Exceptions to these converged probability distributions are that when
the initial temperatures are set very high (for example T0 = 100 rather than 3 in Fig.
3.16 example), the distributions tend to skew toward the training pattern(s) with the
highest probability distributions. For example, out of 100 completely random
patterns, 75 of them may converge to Pattern (c), 5 to Pattern (b), none to Pattern (a),
and the remaining to patterns different from any of the training patterns. This
indicates that a selection of reasonable parameter values such as the initial

3.8 Boltzmann Machines 81

temperature and the minimum temperature is important, which may be determined
through experiment. Fig. 3.17 illustrates a typical behavior of the network when it is
given a noisy test pattern. For this specific run, we set T0 = 3 and (new T) = 0.90 ×
(current T) in Step 4 of the algorithm, to a minimum temperature of 0.01.

Figure 3.17. A typical behavior of the network described in Fig. 3. 16 when it is given a noisy
test pattern.

3.8.4 Appendix. Derivation of Delta-Weights

(Hinton and Sejnowski, 1986, pp. 315-316)

To determine Equation (9), ∆wij = η ⎡⎣< xixj >+ − < xixj >− ⎦⎤ , we substitute G
T

given by equation (7) into equation (8), noting that Pα+ is independent of wij since it

is the probability of the visible neurons in the clamped phase:

∑∆wij =−η∂G =η Pα + ∂Pα − (11)
∂wij Pα − ∂wij .
α

According to the Boltzmann distribution from statistical mechanics, the
probability Pα- is given by:

∑ exp(− Eαβ )
Pαβ − = β T
∑Pα- = . (12)
∑β exp(− Eλμ )
λμ T

In the rightmost expression, Eαβ is the energy of the network in state αβ, and is given
by:

∑ ∑Eαβ = − w x xαβ αβ (13)
ij i j

i j>i

where xiαβ is the ith neuron in state αβ. We note that αβ has been subscripted in the
previous expressions such as Pαβ- and Eαβ, but in xiαβ it is superscripted since we need
to indicate two indices, one for the entire state αβ and the other for each individual
neuron i. Equation (13) is the same as equation (1), except that state αβ is explicitly

shown and the summation is taken over the upper triangle area so that there is no 1/2

82 3 Neural Networks: Other Models

factor. The summation of the denominator of the rightmost expression in equation

(12) is carried over all the states involving both visible and hidden neurons. The
dummy index λμ is used in place of αβ to distinguish it from α and β. This

denominator is a normalization factor called the partition function; we note that with

∑this factor. Pα − = 1.

α

By substituting equation (13) into equation (12), Pα- is expressed in terms of wij

and is directly differentiable with respect to wij. The derivative will be used for the

last factor of equation (11). To carry out the differentiation, we first evaluate

∑ ∑∂ Eαβ Eαβ
Eαβ ∂ 1 w x x ]) =αβ αβ 1 x αβ x αβ exp(− (14)
{exp(− )} = exp(− ) (− [− ij i j i j )

∂wij T T ∂wij T i j>i T T

In the above, from the first expression to the second, we substituted equation (13)

into the second occurrence of Eαβ. From the second expression to the third, we note

that all except the wij term are zero when differentiated with respect to wij. We now

compute ∂Pa - by equation (12), by applying the calculus formula (u/v)' = u'/v -

∂wij

uv'/v2, and by using the result of equation (14):

1 xiαβ xjαβ exp(− Eαβ ) ⎡ ⎤ ⎡1 ⎤
T T- ⎢ )⎥ ⎣⎢T )⎥
∑ ∑ ∑∂Pa- ⎣ Eαβ ⎦ xiλμ xjλμ Eλμ
= T
exp(− exp(−

β β λμ T⎦

∑∂wij exp(− Eλμ ) ∑⎡ exp(− Eλμ ⎤2
λμ T )⎥
⎢
⎣ λμ T ⎦

∑ ∑1 ⎡ ⎤
= ⎢ Pαβ − xiαβ xjαβ − Pα − Pλμ − xi λμ xj λμ ⎥ . (15)
⎣
T β λμ ⎦

Substituting equation (15) into equation (11) yields:

∑∆wij = −η ∂G =η Pα + ∂Pα −
∂wij Pα − ∂wij
α

Pα + ⎡ ⎤
∑ ∑ ∑η λμ ⎥

=

T
⎢ Pαβ − xiαβ xjαβ − Pα − Pλμ − xi λμ xj
⎣
α Pα − β λμ ⎦

∑ ∑ ∑ ∑η ⎡ Pα + λμ λμ ⎤
= ⎢ Pα − Pαβ − xiαβ xjαβ − ( Pα + )( Pλμ − xi xj )⎥ . (16)
⎣
T α β α λμ ⎦

Based on probability theory we have:

Pαβ+ = Pβ | + Pα+ and Pαβ- = Pβ | - Pα-. (17)
α α (18)

Also, in equilibrium are,

Pβ | - = Pβ | α+.
α

Further Reading 83

Since these are conditional probabilities representing the hidden neurons are in state
β given the visible neurons are in state α, they must be the same whether the visible

neurons are clamped or not. Using equations (17) and (18) we have:

Pα + P- = Pα ß+ . (19)
αß
Pα -

Based on probability theory we also have:

∑ Pα + = 1. (20)

α

Substituting equations (19) and (20) into equation (16) we have:

η ⎡ Pαβ + xiαβ xjαβ λμ λμ ⎤
∑ ∑∆wij =T ⎢ − Pλμ − xi xj ⎥ . (21)
⎣ αβ
λμ ⎦

In general, multiplying any quantity Φi for each event i by its corresponding
probability Pi and summing up the products for all events gives <Φ>, the expected or
average value of Φi. The above is a special case where Φ = xixj. Multiplying xi and xj

together with their corresponding probability for each state, and summing up the

products over all possible states, gives the mean correlations between neurons xi and

xj that can be denoted as <xixj>. Using this notation, the first term in the brackets is
<xixj>+, the mean correlations in the clamped (+) phase. Similarly, the second term is
<xixj>-, the mean correlations in the free-running (-) phase. Hence,

∆wij = η ⎡⎣< xixj >+ − < xixj >− ⎦⎤ . (22)
T

Further Reading

In addition to the literature cited at the end of Chapter 2, the following books and
articles discuss specific topics as described below.

The following book presents many optimization problems solved by applying the
Hopfield-Tank model.

Y. Takefuji, Neural Network Parallel Computing, Kluwer Academic, 1992.

The following three are seminal articles of the Hopfield and Hopfield-Tank models.

J.J. Hopfield, "Neural Networks and Physical Systems with Emergent Collective
Computational Abilities," Proceedings of the National Academy of Sciences, Vol. 79,
1982, 2554-2558.

J.J. Hopfield, "Neurons with Graded Response Have Collective Computational
Properties like Those of Tow-state Neurons," Proceedings of the National Academy
of Sciences, Vol. 81, 1984, 3088-3092.

J.J. Hopfield and D.W. Tank, "'Neural' Computation of Decisions in Optimization
Problems," Biological Cybernetics, Vol., 52, 1985, 141-152.

84 3 Neural Networks: Other Models

The following two discuss the Kohonen models.

T. Kohonen, "The 'Neural' Phonetic Typewriter," Computer, Vol. 21, 3, 1988, 11-22.

T. Kohonen, Self-Organization and Associative Memory, 3rd Ed., Springer-Verlag,
1989.

The following two are references for simulated annealing.

S. Kirkparick, C.D. Gelatt, Jr., and M.P. Vecchi, "Optimization by simulated
annealing," Science, vol. 220, 1983, pp. 671-680. Reprinted in J.A. Anderson and E.
Rosenfeld, Eds., Neurocomputing: Foundations of Research, MIT Press, 1988.

S. Geman and D. Geman, "Stochastic relaxation, Gibbs distributions, and the
Bayesian restoration of images," IEEE Transactions on Pattern Analysis and
Machine Intelligence, 6, 1984, pp. 721-741.

The following three are seminal articles on Boltzmann machines.

G. E. Hinton and T. J. Sejnowski, " Optimal Perceptual Inference," Proceedings of
the IEEE Computer Society Conference on Computer Vision and Pattern
Recognition, Washington, DC, New York: IEEE, 1983, pp. 448-453.

D. H. Ackley, G. E. Hinton, and T. J. Sejnowski, "A Learning Algorithm for
Boltzmann Machines," Cognitive Science, vol. 9, 1985, pp. 147-169. Reprinted in
J.A. Anderson and E. Rosenfeld, EDs., Neurocomputing: Foundations of Research,
MIT Press, 1988.

G. E. Hinton and T. J. Sejnowski, "Learning and Relearning in Boltzmann
Machines," in D.E. Rumelhart, J.L. McClelland and the PDP Research Group (Eds.),
Parallel Distributed Processing, Vol. 1, MIT Press, 1986, pp. 282-317.

The following two cited in Chapter 2 provide tutorials for Boltzmann machines and
other neural network models.

J. Hertz, A. Krogh and R.G. Palmer, Introduction to the Theory of Neural Computation,
Addison-Wesley, 1991.

S. Haykin, Neural Networks: A Comprehensive Foundation, 2nd Ed., Prentice-Hall,
1999.

4 Genetic Algorithms and
Evolutionary Computing

4.1 What are Genetic Algorithms and Evolutionary
Computing?

During the four billion year history of the earth, biological life was born, perhaps
as a result of a series of rare chance chemical and physical reactions of molecules.
Over time, more and more complex forms of biological life evolved. Genetic
algorithms are computer models based on genetics and evolution in biology. The
basic elements of a genetic algorithm are: selection of solutions based on their
goodness, reproduction for crossover of genes, and mutation for random change of
genes. Through these processes, genetic algorithms find better and better solutions
to a problem just as species evolve to better adapt to their environments.

Genetic algorithms have been extended in their ways of representing solutions
and performing basic processes. A broader definition of genetic algorithms,
sometimes called evolutionary computing, includes not only generic genetic
algorithms but also classifier systems, genetic programming where each solution is
a computer program, and some aspects of artificial life. Other related areas include
evolvable hardware, evolutionary robotics, ant colony optimization, and swarm
intelligence.

Genetics in real life
Before studying genetic algorithms, we will briefly review genetics in real life,
e.g., in human. A life of a human body starts at fertilization of an egg by a sperm.
Before conception, there are 23 chromosomes, numbered No. 1, 2, ..., 23 in an
egg, and similarly, 23 numbered chromosomes in a sperm (a total of 46). In a
diploid organism like human, two chromosomes of the same number, one from the
egg and the other from the sperm, make a pair of chromosomes for the child (e.g.,
chromosomes mother No. 1 and father No. 1 make child pair No. 1) (Fig. 4.1).

86 4 Genetic Algorithms and Evolutionary Computing

Fig. 4.1A pair of chromosomes for a child

When we closely look at a pair of chromosomes for the child, there are thousands
or even millions of genes on each chromosome. A gene can be thought of as a tiny
point on a chromosome. One gene from the mother and the corresponding gene
from the father make a gene-pair for the child. Each pair (or a certain number of
pairs) of genes contributes to specific characteristics of the child, such as the blood
type, color of eyes, etc. Such characteristics are called phenotypes.

As an illustration, let us see how the child's blood type is determined as one of
four possible types, O, A, B, or AB. Each gene for blood type can have a value of
either 0, 1, or 2, called alleles. Possible gene-pair combinations, called genotypes,
and their corresponding phenotypes are:

Genotype Phenotype

⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯

00 O

10 or 11 A

20 or 22 B

21 AB

For example, if the gene value from mother is 0 and from father is 1, the child's
genotype will be 10 and the phenotype will be blood type A. Note that the order of
gene values in real life is immaterial (e.g., 10 = 01). Alleles differ for different
kinds of genes (e.g., a specific gene may have either 0 or 1, another gene may have
0, 1, 2, or 3, and so on).

When this child grows to an adult and produces an egg (if female) or a sperm (if
male), one chromosome from each chromosome pair is selected for the egg or
sperm. This is how certain phenotypes are inherited from a person to a child, from
the child to a grandchild, and so on. Most gene values are inherited from mother
and father to the child as they are. Occasionally, however, some gene values
change (e.g., from 0 to 1), perhaps because some unusual physical, chemical, or
biological effects (e.g., a gene is hit by a cosmic ray, etc.). Such a change of a gene
value is called mutation.

In any species in biology, those individuals who better adapt to the

4.2 Fundamentals of Genetic Algorithms 87

environment have higher probabilities for survival, thus they have higher
probabilities for producing their offspring. Over generations, this process is
repeated, and the result is that those individuals and genes that better adapt to the
environment tend to remain while those that don't tend to disappear, i.e., become
extinct. This theory of a natural screening process is called (Darwinian) evolution.

The basic idea of genetic algorithms

The computer genetic algorithms which we will study are abstract models of
natural genetics and the evolution process discussed above. Genetic algorithms
include concepts such as chromosomes, genes, mating or crossover breeding,
mutation, and evolution. We will not, however, attempt to build computer models
as close as possible to natural genetics. Rather, we will develop useful models that
are easy to implement in computers by borrowing concepts from natural genetics.

The major process of our genetic algorithm is as follows. At the beginning, we
randomly generate solutions or "chromosomes" for the problem. After the initial
random generation of solutions, we perform iterations. Each iteration consists of
several steps - we select good solutions and perform crossover breeding;
occasionally we may have mutations on certain solutions. Through selection of
good solutions during iterations, the computer will develop increasingly better
solutions as in the case of natural evolution. We can apply this approach to many
types of problems such as optimization and machine learning.

4.2 Fundamentals of Genetic Algorithms

Representations of solutions

A genetic algorithm starts with designing a representation of a solution for the
given problem. A solution here means any value that is a candidate for a correct
solution or a final answer; a solution may or may not be the correct solution. For
example, suppose we want to maximize function y = 5 - (x - 3)2. Then x = 1 is a
solution, x = 2.5 is another solution, and x = 3 is the correct solution that
maximizes y.

The representation of each solution for a genetic algorithm is up to us. It
depends on what each solution looks like and what solution form will be
convenient for applying a genetic algorithm. The most common representation of a
solution is a string of characters.

Consider a finite-length string of characters over a fixed alphabet, e.g., {0,1},
{0, 1, 2}, {0, 1, *}, {0, 1, 2, ..., 9}, or {A, B, ..., Z}, etc. We then choose the length
of each string, such as, 12, 64, or 256, depending on the alphabet used and the
amount of information we want to represent in each string. The larger the alphabet
the more information can be represented by each character; therefore, fewer
characters are necessary to encode a specific amount of information. A string is
somewhat analogous to a chromosome or a set of chromosomes. Suppose that we
represent each solution by a 12-bit string over the alphabet {0, 1}. A solution in
this case may represent a set of values of 12 variables or parameters, each bit

88 4 Genetic Algorithms and Evolutionary Computing

representing a binary value of a parameter. Each parameter, i.e., a bit in this case, is
analogous to a gene. Or, the range of each parameter may be larger than binary 0
and 1. A solution of 12 bits may represent values of 3 parameters, each parameter
using 4 bits. In this case, each parameter can range binary 0000 to 1111, or
decimal 0 to 15; each solution or chromosome has 3 genes.

Given an application problem, we can represent each solution as a fixed-length
string, say, 32 bits. For example, a company is making four kinds of products and
the problem is to find the number of products to make in order to maximize the
profit under certain conditions. Then specific amounts of the products, e.g., (30,
10, 25, 40) for Products 1, 2, 3 and 4, respectively, is a solution, (20, 20, 30, 35) is
another solution, and so on. We can represent a solution by a string, assigning the
first 8 bits to represent the amount for Product 1, the next 8 bits for Product No. 2,
and so on, with the total of 32 bits.

As said before, the representation of each solution for a genetic algorithm is up
to us. Although string representation of a solution is common, other forms of
representation may be more convenient for other problems. For example, for
certain graph problems, a graph can be a solution. A graph can be represented by
an adjacency matrix for certain problems. For a weighted graph problem, each
solution may be a matrix whose elements represent the weights associated with the
edges. For genetic programming problems, each solution is a computer program,
much more structured than a character string. We should be flexible for adopting
the most appropriate form of solution representation for each problem. In this and
following sections, we will use simple string representation on the alphabet, {0, 1}.

The fitness of a solution

The fitness of a solution is a measure that can be used to compare solutions to
determine which is better. For example, a company is trying to maximize a profit.
The profit itself can be used as the fitness, or a scaled value of the profit can be the
fitness. In the following we will briefly describe the fundamental steps of a genetic
algorithm. The meaning of these steps will become clearer when we see examples
in the following sections.

Basic steps of a genetic algorithm

There are variations and extensions of the genetic algorithm procedure. The
following is a simple and typical one. A set of solutions at a specific time step is
called the population.

Step 0. Initialization of the population.

Generate a set of solutions randomly.

Repeat the following three steps until the correct (optimal) solution is found, or
more generally, until a terminal condition is satisfied. For certain problems, we
may not know the correct solution. In such a case, we set up terminal condition(s).
For example, we keep track of the best solution in each iteration. When it does not
improve over a certain number of iterations (e.g., 10), we terminate the iterations.

4.2 Fundamentals of Genetic Algorithms 89

Step 1. Reproduction

(a) Determine the fitness values and their corresponding
probabilities for all the solutions in the population.

(b) Creation of a mating pool. Randomly select solutions weighted
by the fitness. Solutions with higher fitness are more likely to
be picked out than the unfit ones and tend to survive into the
next generation. Here the evolution concept based on the
principle of natural selection is employed.

Step 2. Crossover (recombination) breeding

(a) Take two solutions randomly at a time. With a fixed crossover
probability pc (e.g., pc = 0.7), randomly determine whether
crossover takes place. If crossover does take place, go to the
next substep (b); otherwise, form two offspring that are exact
copies of the two solutions (parents), and go to Step 3.

(b) Select randomly internal points (crossing sites) of the solutions,
then swap the solution parts that follow these points.

Before crossover Next generation solutions
(offsprings)

Solution 1 crossing site ^ ^ ^ _______
Solution 2 ↓ ___∨∨∨∨

^^^∨∨∨∨
_ _ _ _______

Perform this Step 2 for all the solutions obtained in Step 1, i.e., until
the new population size reaches the initially set population size,
randomly selecting a pair at a time.

The significance of the crossover operations is as follows. Each
solution may represent a set of values for parameters, or a prescription
for performing a particular task, and so on. Parts of each solution (e.g.,
substrings of a solution string) may contain notions of importance or
relevance. A genetic algorithm exploits this information by
reproducing high quality notions, then crossing over these notions
among high performers.

Step 3. Random mutation ( or simply mutation).

With a certain fixed small mutation probability, pm (e.g., pm = 0.001),
randomly select a small portion of the solutions and artificially change
it (e.g., a bit of 1 to 0 or 0 to 1). The frequency of mutation is typically
small (e.g., one mutation per thousand bit transfers).

The idea of Step 3 is again modeled from natural mutation. In this way,

Pages:

Click to View FlipBook Version