The words you are searching are inside this book. To get more targeted content, please make full-text search by clicking here.

Laplace Approximation Thursday, September 11, 2008 Rice University STAT 631 / ELEC 639: Graphical Models Instructor: Dr. Volkan Cevher Scribe: Ryan Guerra

Discover the best professional documents and content resources in AnyFlip Document Base.
Search
Published by , 2017-05-16 08:20:03

Laplace Approximation - Rice University Electrical and ...

Laplace Approximation Thursday, September 11, 2008 Rice University STAT 631 / ELEC 639: Graphical Models Instructor: Dr. Volkan Cevher Scribe: Ryan Guerra

Laplace Approximation

Thursday, September 11, 2008

Rice University

STAT 631 / ELEC 639: Graphical Models

Scribe:

Instructor: Ryan Guerra

Dr. Volkan Cevher Reviewers:
Beth Bower and Terrance Savitsky

Index Terms: Laplace Approximation, Approximate Inference, Taylor Series, Chi Distri-
bution, Normal Distribution, Probability Density Function

When we left off with the Joint Tree Algorithm and the Max-Sum Algorithm last class, we
had crafted “messages” to transverse a tree-structured graphical model in order to calculate
marginal and joint distributions. We are interested in finding p(z|x) when p(x) is given as
shown below.

zx

Figure 1. Graph representing both hidden (clear) and observed (shaded)
variables with their conditional dependance indicated by the arrow.

In this case, x is our “observed variable” and z is our “given variable” for which we wish to
make some inference. While we would normally wish to make some sort of exact inference
about p(z|x), this problem is often either impossible to solve or the required algorithm is
intractable.

The next few lectures will focus on deterministic approximations to a pdf and then we will
move on to stochastic approximations. The general hierarchy of approximation techniques
is given here for reference.

Approximate Inference

• Deterministic Approximations

(1) Laplace (local)

(2) Variational Bayes (global)

(3) Expectation Propagation (both)

• Stochastic Approximations

1

2

(1) Metropolis-Hastings/Gibbs
(2) SIS

1. Laplace Approximations to a PDF

1.1. Motivation for Representation. The idea here is that we wish to approximate any
pdf such as the one given below with a nice, simple representation.

Figure 2. An example multi-modal distribution that we want to approximate.

The Laplace approximation is a method for using a Gaussian ∼ N (µ, σ2) to represent a
given pdf. This is obviously more effective for a single-mode1 distribution, as many popular
distributions could be roughly represented with a Gaussian.

As an example of what we mean by “represent,” consider that we have some function g(x)
distributed according to the density function p(x) and we wish to get the expected value
E{g(x)} through sampling.

E{g(x)} = g(x)p(x)dx

We wish to calculate this expected value by sampling discrete values from p(z), and thus
get an estimate Eˆ[g(x)] for E[g(x)] that can be calculated as such:

Eˆ{g(x)} = 1 L
L
g(xi)

i=1

1A mode is a concentration of mass in a pdf. We could imagine a non-nice distribution like a U-quadratic
that may not have its mass concentrated in a single area that would be poorly represented by the Laplace
approximation.

3

Figure 3. The original distribution we want to represent (blue), with its
Gaussian approximation (red) obtained by using the Laplace approxima-
tion method. Note that we were only able to capture one of the original
distribution’s modes.

Often we will find that p(x) cannot be easily sampled, and we wish to find an alternative
way to draw samples from p(x). This is the subject matter for chapter 11 from Bishop[2],
but suffice to say we can instead draw samples from another, “nicer” distribution q(x),
where q(x) is some known pdf and q(x) = 0.

Eˆ{g(x)} = g(xi)p(xi) = g(xi) p(xi) q(xi)
q(xi)
i i

So far, we haven’t said anything about the choice of q(x) we could use to represent our
pdf, but we’d like to use something simple and computable because as the dimensions of
the problem increase, the required computational memory increases dramatically. This is
why we need approximations.

Thus far we have introduced the motivation behind approximation schemes, in particular
the method of Laplace approximation. We will proceed by derving the Laplace Approxima-
tion using Taylor series expansions. Then we move to a paper by Tierney and Kadane [3]
and describe the use of Laplace approximation to estimate a posterior mean. We conclude
with an example of approximating the Chi distribution with a Normal distribution and
demonstrate the quality of approximation through graphics.

1.2. Derivation of the Laplace Approximation. Suppose we wished to approximate

p(x) = f (x) , f (x) ≥ 0. Let’s look at the Taylor series expansion2 for the ln f (x):
z

˛˛ ˛ ˛

2Definition: f (x) = f (x)˛˛ + f (x) ˛ ·(x−x0)+ f (x) ˛ · (x − x0 )2 + ...+ f (n) (x) ˛ ·(x−x0)n +...
˛ 1! ˛ n! ˛
˛ 2! ˛ ˛
˛
x=x0 x=x0 x=x0 x=x0

4

(1) ln f (x) = ln f (x0) + ∂ ln f (x) · (x − x0) + 1 ∂2 ln f (x) · (x − x0)2 + h.o.t...
∂x 2 ∂x2
x=x0 x=x0

second term

Let’s assume that the higher-order terms are negligible and focus, for now, on the second
term in (1):

∂ ln f (x) 1 ∂f (x)
(2) x=x0 · (x − x0) = f (x) · ·(x − x0)
∂x ∂x
x=x0

*

We notice that (*) is zero at local maxima of the pdf. If we find this local maxima and

choose to expand our Taylor series around this point xmax we ensure that the second term

in the RHS of (1) is always zero. This is done by setting ∂f (x) equal to zero and solving
∂x

for xmax, the local maxima of the pdf.

Taking the first three terms of the Taylor series expansion around x0 = xmax, then (1)
becomes:

(3) ln f (x) = ln f (xmax) + 1 ∂2 ln f (x) · (x − xmax)2
2 ∂x2
x=xmax

(4) eln f(x) = exp ln f (xmax) + 1 ∂2 ln f (x) · (x − xmax)2
2 ∂x2
x=xmax



eln f (x)dx = eln f (xmax)  1 ∂2 ln f (x) 
)2
(5) constant exp  ∂x2 ·(x − xmax  dx
2
 x=xmax 

constant

Where we took the exponent in (4) and integrated both sides in (5). We see that the

RHS of (5) contains a bunch of constants and a single term inside the exponent that is

quadratic with respect to x. For the sake of simplicity, let’s let ln f (x) = F (x). Then if we

let σ2 = −L 1 we get a result that looks remarkably like a Gaussian!
(xmax )

(6) eL(x)dx ≈ eL(xmax) exp − (x − xmax)2 dx
2σ2

This is the result of the Laplace method for integrals, though it is cited with an additional
term n and with the traditional notation xmax = x∗ in Tierney & Kadane [3] as:

5

(7) enL(x)dx ≈ enL(x∗) exp − n(x−x∗ )2 dx = √2πσn− 1 enL(x∗)
2σ2 2

1.3. Application. To elaborate further on what (7) gets us, I’m going to borrow from
Tierney & Kadane’s paper. Given a smooth, positive function g, we wish to approximate
the posterior mean of g(x).

(8) En[g(x)] = E[g(x)|Y (n)] = P r[g, Y (n)] = g(x)eL(x)π(x)dx
P r[Y (n)] eL(x)π(x)dx

Where L(x) is the log-likelihood function ln p(x), π(x) is the prior density, and Y (n) is the
observed set of data. As we can see, the forms of the integrals in (8) are very similar to
the forms seen in (7), and are then easily estimated by the Laplace approximation of a pdf.
But you can read the paper if you’re interested.

We now have a step-by-step process for using the Laplace approximation to approximate
a single-mode pdf with a Gaussian:

(1) find a local maximum xmax of the given pdf f (x)

(2) calculate the variance σ2 = − f 1
(xmax )

(3) approximate the pdf with p(x) ≈ N [xmax, σ2]

e− x2 k
2 2
1.4. Example: Chi distribution. p(x) = ,xk−1 z = 2 −1 Γ( k ), x > 0
2
z

Note that z is a normalization constant that doesn’t depend on x. Apparently most books
don’t even bother with it, and we’ll ignore it here for the sake of convenience. Remember
that we’re working with the log-likelihood here, so f (x) = ln p(x)

f (x) = ln p(x) = ln xk−1 + ln e− x2
2

∂∂ ln xk−1 − x2
ln p(x) = 2
∂x ∂x

1 · (k − 1)xk−2 −x
= xk−1

k − 1 − x = 0
=
√x
x∗ = k− 1

Now that we’ve found the local maximum x∗, we compute the variance σ2 = − f 1
(x∗)

6

f (x∗) = ∂ 2f (x)
∂x2
x=x∗

=− k−1 −1

x2 x=x∗

=−2

σ2 1
=
2

Now all we need is to create the Normal distribution:

pˆ(x) ∼ N x √ − 1, 1
k
2

0.75

0.5

0.25

0 0.8 1.6 2.4 3.2 4 4.8 5.6 6.4 7.2 8 8.8 9.6 10.4 11.2

Figure 4. A plot of four Chi distributions (solid) and the corresponding
Normal approximations (dashed)

1.5. Application: A Medical Inference Problem. In the paper ”Laplace’s Method
Approximations for Probabilistic Inference in Belief Networks with Continuous Variables”
[1], the authors introduce a medical inference problem. The figure below shows the Bayesian
graphical model for the problem. There are two different experimental treatments for a
disease. The goal is to estimate the posterior mean of the increase in one year survival
probability.

In previous work, the authors ran a Monte Carlo simulation to sample from the posterior
to determine the posterior mean. In this example, we compare the Laplace approximation
to the posterior to the Monte Carlo sampling method. Below is a graph of the Monte Carlo
sampling and the Laplace approximation superimposed.

7

Figure 5. The Graphical Model

Figure 6. Laplace Approximation v. Monte Carlo sampling
The Laplace approximation appears to do well approximating the posterior. The authors
note that the Monte Carlo method takes 20 times longer computationally than the Laplace
approximation, making Laplace approximation suitable for this example.

References

[1] Adriano Azevedo-Filho and Ross D. Shachter. Laplace’s method approximations for probabilistic infer-
ence in belief networks with continuous variables. Uncertainty in Artificial Intelligence, 1994.

[2] Christopher Bishop. Pattern Recognition and Machine Learning. Springer, 2007.

8

[3] Luke Tierney and Joseph Kadane. Accurate approximations for posterior moments and marginal den-
sities. Journal of the American Statistical Association, 81(393), 1986.


Click to View FlipBook Version