The words you are searching are inside this book. To get more targeted content, please make full-text search by clicking here.

On the Bethe approximation Adrian Weller Department of Statistics at Oxford University September 12, 2014 Joint work with Tony Jebara 1/46

Discover the best professional documents and content resources in AnyFlip Document Base.
Search
Published by , 2017-05-18 02:10:04

On the Bethe approximation - Columbia University

On the Bethe approximation Adrian Weller Department of Statistics at Oxford University September 12, 2014 Joint work with Tony Jebara 1/46

On the Bethe

Adrian

Department of Statistic
September

Joint work wit

approximation

n Weller

cs at Oxford University
r 12, 2014
th Tony Jebara

1 / 46

Outline

1 Background on inference in g
Examples
What are the problems and
Belief propagation (BP) as
Variational perspective on

2 Bethe approximation
Link to BP
Other methods to minimiz
New approach: discretize t
optimum

How discretize s.t. -ap
How search efficiently o
If time: Understanding the
approximation (entropy an
clamping...

Questions (anytime!)

graphical models

d why are they interesting?
s sum-product message passing
inference

ze the Bethe free energy
to obtain an -approx global
pprox guaranteed?
over the discretized space?
e two aspects of the Bethe
nd polytope), and new work on

2 / 46

Background

Focus on undirected probabilistic
Markov random fields (MRFs).
Compact, powerful way to model
Many applications, including:

Systems biology (protein fold
Social network analysis (frien
Combinatorial problems (cou
Computer vision (image deno
Error-correcting codes (turbo
communication)

graphical models, also called
dependencies among variables.

ding)
nds, politics, terrorism)
unting independent sets)
oising, depth perception)
o codes, 3G/4G phones, satellite

3 / 46

Example: image denoising

Inference is combining prior belie
form a prediction.

−→ MAP

efs with observed evidence to

P inference 4 / 46

Notation

Focus on MRFs which are dis

n variables V = {X1, . . . , Xn}
over subsets/factors c of V ,
score to sub-configurations w

Write x = (x1, . . . , xn) ∈ X f
configuration and xc for a co
ψc maps each setting xc → ψ

1 ψc (xc ) =
p(x) = exp
Z c ∈C

where the partition function Z =
normalizing constant to ensure th
E is the energy (negative score) c

screte and finite

} and (log) potential functions ψc
c ∈ C ⊆ P(V ) which give higher

with higher compatibility

for one particular complete
onfiguration of the variables in c

ψc (xc ) ∈ R [lookup table]

= e−E (x) E =− ψc (xc ),
,
Z
c ∈C

x∈X exp c∈C ψc (xc ) is the
hat probabilities sum to 1;

cf. physics.

5 / 46

Inference: 3 key problems

1
p(x) = exp

Z

MAP inference, identify a con
maximum probability: x∗ ∈ a
Marginal inference, Compute
subset of variables xc :
p(xc ) = x∈X :Xc =xc p(x ) =
Evaluate the partition functio
Z = x∈X exp c∈C ψc (xc
Great interest to find classes o
such that exact or approximate

ψc (xc )

c ∈C

nfiguration of variables with
arg maxx∈X c∈C ψc (xc )
e the probability distribution of a

( )x∈X :Xc =xc exp c∈C ψc (xc )
x∈X exp( )c∈C ψc (xc )

on,
c)

of problems and approaches
e inference is tractable.

6 / 46

Remark: conditioning on obse

1
p(x) = exp

Z

Suppose V is split into observed

variables XU so x = (xu, y ), xu ∈ X

p(xu|y ) = p(xu,y ) = p(xu ,y
p(y ) xu ∈Xu p(

This is just a new smaller MR

the variable set XU

New partition function to nor

Hence the MRF framework is
conditioning

When we discuss MRFs, they
based on conditioning on var

erved variables

ψc (xc )

c ∈C

variables Y = y and unobserved

Xu

)
(xu,y )

RF with modified potentials on

rmalize the new distribution
s rich enough to handle

y might or might not have been
riables

7 / 46

Belief propagation (BP) for in

Marginal inference via sum-p
Send messages from variab

mv→c (xv ) =

Send messages from factor

mc→v (xv ) =

xc :xv =x

where φc (xc ) = exp(ψc (xc
For MAP inference, use max-
For acyclic models, converges
(2 passes, collect leaves to ro

nference

product message passing
ble v ∈ V to factor c ∈ C

= mc∗→v (xv )

c∗∈C (v )\{c}

r c to variable v

φc (xc ) mv∗→c (xv∗)

xv v ∗∈V (c)\{v }

))

-product, switch xc → maxxc
s to exact marginals efficiently
oot then distribute)

8 / 46

What about cyclic (loopy) mo

Can triangulate and run junc
Exact solution but takes time
Or... just run loopy belief pro

Often produces strikingly g
But may not converge at a
Extensive literature on trying

odels?

ction tree
e exponential in treewidth
opagation (LBP) and hope
good results
all
g to understand LBP

9 / 46

Inference: a variational perspe

1
Recall p(x) = exp

Z

KL-divergence between some

by D(q||p) = x q(x ) log q(x
p(x

Have

0 ≤ D(q||p) = q(x)

x

= −S(q) −

= Eq(E (x)
where S(q) is the standard Sh

ective

ψc (xc ) e−E (x)
=
c ∈C
Z

e distribution q(x) and p(x) given

x) ≥ 0, equality iff q = p
x)

) log q(x) − q(x) log p(x)

x

− q(x) [−E (x) − log Z ]

x

)) − S(q) + log Z

hannon entropy of q

10 / 46

Inference: a variational perspe

0 ≤ D(q||p) = Eq(E (x)) − S(

Hence Eq(E (x)) − S(q) ≥
This function of distribution q
FG (q) = Eq(E (x)) − S(q)
Minimizing it over all valid di
And the arg min is exactly wh
Hence can think of inference
But still intractable in genera
END OF PART I

ective

(q) + log Z , equality iff q = p

≥ − log Z
q is called the (Gibbs) free energy
istributions q yields − log Z
hen q = p, the true distribution
as optimization
al...

11 / 46

Part II: Bethe approximation

Seek to approximate the part
Also interested in approximat
diagnosis, power network)

The Bethe approximation: what a
Introduced by Hans Bethe in
transitions in statistical physi
Bethe left Germany in 1933,
receiving an offer as lecturer.
Rudolf Peierls... This meant
speak to in German, and did

tition function Z
te marginal inference (medical

and why?
the 1930s to study phase
ics. Wikipedia:
moving to England after

... He moved in with his friend
that Bethe had someone to
not have to eat English food.

12 / 46

Part II: Bethe approximation

Seek to approximate the part
Also interested in approximat
diagnosis, power network)

The Bethe approximation: what a
Introduced by Hans Bethe in
transitions in statistical physi
Bethe left Germany in 1933,
receiving an offer as lecturer.
Rudolf Peierls... This meant
speak to in German, and did
Found fresh application in ma
Direct connections to variatio
propagation [YFW01]

tition function Z
te marginal inference (medical

and why?
the 1930s to study phase
ics. Wikipedia:
moving to England after

... He moved in with his friend
that Bethe had someone to
not have to eat English food.
achine learning
onal inference and belief

12 / 46

Recall variational approach

− log Z = min FG (q) =

q∈M

M is the marginal polytope which
probability distributions over all th
2n configurations (for binary varia
FG is the Gibbs free energy, optim

Bethe approximation has 2 aspect
1 Relax the marginal polytope
enforces only pairwise consist
2 Use Bethe entropy SB = i∈

Obtain Bethe partition function Z

− log ZB = min F (q) =

q∈L

= min Eq(E ) − S(q(x))

q∈M

h comprises all globally valid
he variables, i.e. convex hull of all
ables)
mum at the true distribution

ts, both pairwise approximations:
M to the local polytope L which
tency, hence pseudo-marginals
∈V Si + (i,j)∈E Sij − Si − Sj
ZB at the global optimum

= min Eq (E ) − SB (q(x ))

q∈L

13 / 46

Connection to LBP

Obtain Bethe partition function Z
− log ZB = min F = m

q∈L q

marginal polytope
(global consistency)
F is called the Bethe free energy
In a seminal paper, [YFW01]
correspond to stationary poin
Refined by [Hes02], stable fix
minima of F (converse not tr

ZB at the global optimum
min Eq(E ) − SB (q(x))

q∈L

local polytope
(local consistency)
(approximates true free energy)
] showed that fixed points of LBP
nts of the Bethe free energy F
xed points correspond to local
rue in general)

14 / 46

Other methods to minimize B

LBP may be viewed as an algorith

But may not converge, or ma
minimum
Spurred much effort to find c

Gradient methods [WT01]
Double loop methods, e.g.

But still only to a local optim
For binary pairwise models

Recent algorithm guarante
to an approximately station
on topology
Our algorithm guaranteed
global optimum [WJ14]
To our knowledge, no prev
return or approximate the

Bethe free energy F

hm to try to minimize F
ay converge only to a local

convergent algorithms such as

. CCCP [Yui02] or [HAK03]
mum, no time guarantee

eed to converge in polynomial time
nary point of F [Shi12], restrictions

to return an -approximation to the

viously known methods guaranteed to
global optimum

15 / 46

Binary pairwise MRFs

Main focus now on MRFs which a
and pairwise, i.e. all potentials are

n variables V = {X1, . . . , Xn}
x = (x1, . . . , xn) ∈ {0, 1}n is
m edges (i, j) ∈ E ⊆ V × V ,


1
p(x) = Z exp  ψi (

i ∈V

are binary, i.e. all Xi ∈ {0, 1},
e over ≤ 2 variables

}, singleton potentials ψi (xi )
one particular configuration

pairwise potentials ψij (xi , xj )


(xi ) + ψij (xi , xj )

(i ,j )∈E

16 / 46

Binary pairwise MRFs

Main focus now on MRFs which a
and pairwise, i.e. all potentials are

n variables V = {X1, . . . , Xn}
x = (x1, . . . , xn) ∈ {0, 1}n is
m edges (i, j) ∈ E ⊆ V × V ,


1
p(x) = Z exp  ψi (

i ∈V

Can always reparameterize to a m
{θi : i ∈ V }, {Wij : (i , j) ∈ E} s.t.


1
p(x) = exp  θ
Z

i ∈V

are binary, i.e. all Xi ∈ {0, 1},
e over ≤ 2 variables

}, singleton potentials ψi (xi )
one particular configuration

pairwise potentials ψij (xi , xj )


(xi ) + ψij (xi , xj )

(i ,j )∈E

minimal representation
. same distribution



θi xi + Wij xi xj 

(i ,j )∈E

16 / 46

Binary pairwise MRFs: simple


1
p(x) = Z exp  ψi (

i ∈V

Can always reparameterize to a m
{θi : i ∈ V }, {Wij : (i , j) ∈ E} s.t.


1
p(x) = exp  θ
Z

i ∈V

X1 X2

local θ1 = 4 ψ1(x1)
local θ2 = −5 x1 ψ1(x1)
edge Wij = 3 02
Wij > 0 attractive 14
Wij < 0 repulsive

e example



(xi ) + ψij (xi , xj )

(i ,j )∈E

minimal representation

. same distribution



θi xi + Wij xi xj 

(i ,j )∈E

ψ12(x1, x2) ψ2(x2)
x1\x2 0 1 x2 0 1
ψ2(x2) −1 −2
0 1 −3
17 / 46
1 32

Bethe pseudo-marginals in the

− log ZB = min F = m

q∈L q

Must identify q(x) ∈

q defined by singleton pseudo-mar
and pairwise µij ∀(i, j) ∈ E. Local

µij = p(Xi = 0, Xj = 0) p(Xi = 0,
p(Xi = 1, Xj = 0) p(Xi = 1,

with constaint that all terms ≥ 0 ⇒ ξ

[WT01] showed:
Minimizing F, can solve expl
Here Wij is the associativity
Hence sufficient to search ove

e local polytope

min Eq(E ) − SB (q(x))

q∈L

∈ L that minimizes F

rginals qi = p(Xi = 1) ∀i ∈ V
l polytope constraints imply

Xj = 1) = 1 + ξij − qi − qj qj − ξij
Xj = 1) qi − ξij ξij

ξij ∈ [max(0, qi + qj − 1), min(qi , qj )]

licitly for ξij (qi , qj , Wij )
of the edge (as earlier)
er (q1, . . . , qn) ∈ [0, 1]n, but how?

18 / 46

Our approach: a mesh over B

We discretize the space (q1, . . . , q
sufficient mesh M( ), fine enough
q∗ has F (q∗) ≤ minq∈L F (q) +

q3 1
0.9
0.8 00
0.7
0.6
0.5
0.4
0.3
0.2
0.1

0
1

0.5

q2

Bethe pseudo-marginals

qn) ∈ [0, 1]n with a provably
h s.t. optimum discretized point

1
0.8
0.6
0.4
0.2

q1

19 / 46

Key ideas to approximate log

Discretize to construct a prov
How guarantee F (q∗) ≤ m
How search the large discr

ZB to within

vably sufficient mesh M( ):
minq∈L F (q) + ?
rete mesh efficiently?

20 / 46

Key ideas to approximate log

Discretize to construct a prov
How guarantee F (q∗) ≤ m
How search the large discr

Developed two approaches:
curvMesh bounds curvatur
gradMesh bounds gradient
magnitude) [WJ14]

ZB to within

vably sufficient mesh M( ):
minq∈L F (q) + ?
rete mesh efficiently?
re [WJ13]
ts - typically much better (orders of

20 / 46

Key ideas to approximate log

Discretize to construct a prov
How guarantee F (q∗) ≤ m
How search the large discr

Developed two approaches:
curvMesh bounds curvatur
gradMesh bounds gradient
magnitude) [WJ14]

If original model attractive, i.
(submodular cost functions),
multi-label problem is submo

Hence, can be solved via g
O(N3) where N = i∈V N
Obtain FPTAS with gradM
To compare, for curvMesh,
N = O −1/2n7/4∆3/4 exp

ZB to within

vably sufficient mesh M( ):
minq∈L F (q) + ?
rete mesh efficiently?

re [WJ13]
ts - typically much better (orders of

.e. Wij > 0 ∀(i , j) ∈ E
then show the discretized

odular [WJ13,KKL12]

graph cuts [SF06] i∈V Ni ]

Ni points in dim i [cf.
Mesh, N = O nmW

,

p 1 (W (1 + ∆/2) + T )
2

20 / 46

Bounding the locations of sta

For general edge types (associativ
Wi = j∈N(i):Wij >0 Wij , Vi = −

Theorem (WJ13)

At any stationary point of the Bet
σ(θi − Vi ) ≤ qi ≤ σ(θi + Wi )

Developed an algorithm (Bet
iteratively improves these bou
[MK07] already had a similar
possible beliefs in LBP - bit s
Use this to preprocess model

reduces search space direct
for curvMesh lowers max c

ationary points

ve or repulsive), let
j∈N(i):Wij <0 Wij

the free energy,

the bound propagation BBP) that
unds
r algorithm, finds ranges of
slower but typically better
to yield a smaller orthotope
tly
curvature, hence coarser mesh

21 / 46


Click to View FlipBook Version