On the Bethe
Adrian
Department of Statistic
September
Joint work wit
approximation
n Weller
cs at Oxford University
r 12, 2014
th Tony Jebara
1 / 46
Outline
1 Background on inference in g
Examples
What are the problems and
Belief propagation (BP) as
Variational perspective on
2 Bethe approximation
Link to BP
Other methods to minimiz
New approach: discretize t
optimum
How discretize s.t. -ap
How search efficiently o
If time: Understanding the
approximation (entropy an
clamping...
Questions (anytime!)
graphical models
d why are they interesting?
s sum-product message passing
inference
ze the Bethe free energy
to obtain an -approx global
pprox guaranteed?
over the discretized space?
e two aspects of the Bethe
nd polytope), and new work on
2 / 46
Background
Focus on undirected probabilistic
Markov random fields (MRFs).
Compact, powerful way to model
Many applications, including:
Systems biology (protein fold
Social network analysis (frien
Combinatorial problems (cou
Computer vision (image deno
Error-correcting codes (turbo
communication)
graphical models, also called
dependencies among variables.
ding)
nds, politics, terrorism)
unting independent sets)
oising, depth perception)
o codes, 3G/4G phones, satellite
3 / 46
Example: image denoising
Inference is combining prior belie
form a prediction.
−→ MAP
efs with observed evidence to
P inference 4 / 46
Notation
Focus on MRFs which are dis
n variables V = {X1, . . . , Xn}
over subsets/factors c of V ,
score to sub-configurations w
Write x = (x1, . . . , xn) ∈ X f
configuration and xc for a co
ψc maps each setting xc → ψ
1 ψc (xc ) =
p(x) = exp
Z c ∈C
where the partition function Z =
normalizing constant to ensure th
E is the energy (negative score) c
screte and finite
} and (log) potential functions ψc
c ∈ C ⊆ P(V ) which give higher
with higher compatibility
for one particular complete
onfiguration of the variables in c
ψc (xc ) ∈ R [lookup table]
= e−E (x) E =− ψc (xc ),
,
Z
c ∈C
x∈X exp c∈C ψc (xc ) is the
hat probabilities sum to 1;
cf. physics.
5 / 46
Inference: 3 key problems
1
p(x) = exp
Z
MAP inference, identify a con
maximum probability: x∗ ∈ a
Marginal inference, Compute
subset of variables xc :
p(xc ) = x∈X :Xc =xc p(x ) =
Evaluate the partition functio
Z = x∈X exp c∈C ψc (xc
Great interest to find classes o
such that exact or approximate
ψc (xc )
c ∈C
nfiguration of variables with
arg maxx∈X c∈C ψc (xc )
e the probability distribution of a
( )x∈X :Xc =xc exp c∈C ψc (xc )
x∈X exp( )c∈C ψc (xc )
on,
c)
of problems and approaches
e inference is tractable.
6 / 46
Remark: conditioning on obse
1
p(x) = exp
Z
Suppose V is split into observed
variables XU so x = (xu, y ), xu ∈ X
p(xu|y ) = p(xu,y ) = p(xu ,y
p(y ) xu ∈Xu p(
This is just a new smaller MR
the variable set XU
New partition function to nor
Hence the MRF framework is
conditioning
When we discuss MRFs, they
based on conditioning on var
erved variables
ψc (xc )
c ∈C
variables Y = y and unobserved
Xu
)
(xu,y )
RF with modified potentials on
rmalize the new distribution
s rich enough to handle
y might or might not have been
riables
7 / 46
Belief propagation (BP) for in
Marginal inference via sum-p
Send messages from variab
mv→c (xv ) =
Send messages from factor
mc→v (xv ) =
xc :xv =x
where φc (xc ) = exp(ψc (xc
For MAP inference, use max-
For acyclic models, converges
(2 passes, collect leaves to ro
nference
product message passing
ble v ∈ V to factor c ∈ C
= mc∗→v (xv )
c∗∈C (v )\{c}
r c to variable v
φc (xc ) mv∗→c (xv∗)
xv v ∗∈V (c)\{v }
))
-product, switch xc → maxxc
s to exact marginals efficiently
oot then distribute)
8 / 46
What about cyclic (loopy) mo
Can triangulate and run junc
Exact solution but takes time
Or... just run loopy belief pro
Often produces strikingly g
But may not converge at a
Extensive literature on trying
odels?
ction tree
e exponential in treewidth
opagation (LBP) and hope
good results
all
g to understand LBP
9 / 46
Inference: a variational perspe
1
Recall p(x) = exp
Z
KL-divergence between some
by D(q||p) = x q(x ) log q(x
p(x
Have
0 ≤ D(q||p) = q(x)
x
= −S(q) −
= Eq(E (x)
where S(q) is the standard Sh
ective
ψc (xc ) e−E (x)
=
c ∈C
Z
e distribution q(x) and p(x) given
x) ≥ 0, equality iff q = p
x)
) log q(x) − q(x) log p(x)
x
− q(x) [−E (x) − log Z ]
x
)) − S(q) + log Z
hannon entropy of q
10 / 46
Inference: a variational perspe
0 ≤ D(q||p) = Eq(E (x)) − S(
Hence Eq(E (x)) − S(q) ≥
This function of distribution q
FG (q) = Eq(E (x)) − S(q)
Minimizing it over all valid di
And the arg min is exactly wh
Hence can think of inference
But still intractable in genera
END OF PART I
ective
(q) + log Z , equality iff q = p
≥ − log Z
q is called the (Gibbs) free energy
istributions q yields − log Z
hen q = p, the true distribution
as optimization
al...
11 / 46
Part II: Bethe approximation
Seek to approximate the part
Also interested in approximat
diagnosis, power network)
The Bethe approximation: what a
Introduced by Hans Bethe in
transitions in statistical physi
Bethe left Germany in 1933,
receiving an offer as lecturer.
Rudolf Peierls... This meant
speak to in German, and did
tition function Z
te marginal inference (medical
and why?
the 1930s to study phase
ics. Wikipedia:
moving to England after
... He moved in with his friend
that Bethe had someone to
not have to eat English food.
12 / 46
Part II: Bethe approximation
Seek to approximate the part
Also interested in approximat
diagnosis, power network)
The Bethe approximation: what a
Introduced by Hans Bethe in
transitions in statistical physi
Bethe left Germany in 1933,
receiving an offer as lecturer.
Rudolf Peierls... This meant
speak to in German, and did
Found fresh application in ma
Direct connections to variatio
propagation [YFW01]
tition function Z
te marginal inference (medical
and why?
the 1930s to study phase
ics. Wikipedia:
moving to England after
... He moved in with his friend
that Bethe had someone to
not have to eat English food.
achine learning
onal inference and belief
12 / 46
Recall variational approach
− log Z = min FG (q) =
q∈M
M is the marginal polytope which
probability distributions over all th
2n configurations (for binary varia
FG is the Gibbs free energy, optim
Bethe approximation has 2 aspect
1 Relax the marginal polytope
enforces only pairwise consist
2 Use Bethe entropy SB = i∈
Obtain Bethe partition function Z
− log ZB = min F (q) =
q∈L
= min Eq(E ) − S(q(x))
q∈M
h comprises all globally valid
he variables, i.e. convex hull of all
ables)
mum at the true distribution
ts, both pairwise approximations:
M to the local polytope L which
tency, hence pseudo-marginals
∈V Si + (i,j)∈E Sij − Si − Sj
ZB at the global optimum
= min Eq (E ) − SB (q(x ))
q∈L
13 / 46
Connection to LBP
Obtain Bethe partition function Z
− log ZB = min F = m
q∈L q
marginal polytope
(global consistency)
F is called the Bethe free energy
In a seminal paper, [YFW01]
correspond to stationary poin
Refined by [Hes02], stable fix
minima of F (converse not tr
ZB at the global optimum
min Eq(E ) − SB (q(x))
q∈L
local polytope
(local consistency)
(approximates true free energy)
] showed that fixed points of LBP
nts of the Bethe free energy F
xed points correspond to local
rue in general)
14 / 46
Other methods to minimize B
LBP may be viewed as an algorith
But may not converge, or ma
minimum
Spurred much effort to find c
Gradient methods [WT01]
Double loop methods, e.g.
But still only to a local optim
For binary pairwise models
Recent algorithm guarante
to an approximately station
on topology
Our algorithm guaranteed
global optimum [WJ14]
To our knowledge, no prev
return or approximate the
Bethe free energy F
hm to try to minimize F
ay converge only to a local
convergent algorithms such as
. CCCP [Yui02] or [HAK03]
mum, no time guarantee
eed to converge in polynomial time
nary point of F [Shi12], restrictions
to return an -approximation to the
viously known methods guaranteed to
global optimum
15 / 46
Binary pairwise MRFs
Main focus now on MRFs which a
and pairwise, i.e. all potentials are
n variables V = {X1, . . . , Xn}
x = (x1, . . . , xn) ∈ {0, 1}n is
m edges (i, j) ∈ E ⊆ V × V ,
1
p(x) = Z exp ψi (
i ∈V
are binary, i.e. all Xi ∈ {0, 1},
e over ≤ 2 variables
}, singleton potentials ψi (xi )
one particular configuration
pairwise potentials ψij (xi , xj )
(xi ) + ψij (xi , xj )
(i ,j )∈E
16 / 46
Binary pairwise MRFs
Main focus now on MRFs which a
and pairwise, i.e. all potentials are
n variables V = {X1, . . . , Xn}
x = (x1, . . . , xn) ∈ {0, 1}n is
m edges (i, j) ∈ E ⊆ V × V ,
1
p(x) = Z exp ψi (
i ∈V
Can always reparameterize to a m
{θi : i ∈ V }, {Wij : (i , j) ∈ E} s.t.
1
p(x) = exp θ
Z
i ∈V
are binary, i.e. all Xi ∈ {0, 1},
e over ≤ 2 variables
}, singleton potentials ψi (xi )
one particular configuration
pairwise potentials ψij (xi , xj )
(xi ) + ψij (xi , xj )
(i ,j )∈E
minimal representation
. same distribution
θi xi + Wij xi xj
(i ,j )∈E
16 / 46
Binary pairwise MRFs: simple
1
p(x) = Z exp ψi (
i ∈V
Can always reparameterize to a m
{θi : i ∈ V }, {Wij : (i , j) ∈ E} s.t.
1
p(x) = exp θ
Z
i ∈V
X1 X2
local θ1 = 4 ψ1(x1)
local θ2 = −5 x1 ψ1(x1)
edge Wij = 3 02
Wij > 0 attractive 14
Wij < 0 repulsive
e example
(xi ) + ψij (xi , xj )
(i ,j )∈E
minimal representation
. same distribution
θi xi + Wij xi xj
(i ,j )∈E
ψ12(x1, x2) ψ2(x2)
x1\x2 0 1 x2 0 1
ψ2(x2) −1 −2
0 1 −3
17 / 46
1 32
Bethe pseudo-marginals in the
− log ZB = min F = m
q∈L q
Must identify q(x) ∈
q defined by singleton pseudo-mar
and pairwise µij ∀(i, j) ∈ E. Local
µij = p(Xi = 0, Xj = 0) p(Xi = 0,
p(Xi = 1, Xj = 0) p(Xi = 1,
with constaint that all terms ≥ 0 ⇒ ξ
[WT01] showed:
Minimizing F, can solve expl
Here Wij is the associativity
Hence sufficient to search ove
e local polytope
min Eq(E ) − SB (q(x))
q∈L
∈ L that minimizes F
rginals qi = p(Xi = 1) ∀i ∈ V
l polytope constraints imply
Xj = 1) = 1 + ξij − qi − qj qj − ξij
Xj = 1) qi − ξij ξij
ξij ∈ [max(0, qi + qj − 1), min(qi , qj )]
licitly for ξij (qi , qj , Wij )
of the edge (as earlier)
er (q1, . . . , qn) ∈ [0, 1]n, but how?
18 / 46
Our approach: a mesh over B
We discretize the space (q1, . . . , q
sufficient mesh M( ), fine enough
q∗ has F (q∗) ≤ minq∈L F (q) +
q3 1
0.9
0.8 00
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
1
0.5
q2
Bethe pseudo-marginals
qn) ∈ [0, 1]n with a provably
h s.t. optimum discretized point
1
0.8
0.6
0.4
0.2
q1
19 / 46
Key ideas to approximate log
Discretize to construct a prov
How guarantee F (q∗) ≤ m
How search the large discr
ZB to within
vably sufficient mesh M( ):
minq∈L F (q) + ?
rete mesh efficiently?
20 / 46
Key ideas to approximate log
Discretize to construct a prov
How guarantee F (q∗) ≤ m
How search the large discr
Developed two approaches:
curvMesh bounds curvatur
gradMesh bounds gradient
magnitude) [WJ14]
ZB to within
vably sufficient mesh M( ):
minq∈L F (q) + ?
rete mesh efficiently?
re [WJ13]
ts - typically much better (orders of
20 / 46
Key ideas to approximate log
Discretize to construct a prov
How guarantee F (q∗) ≤ m
How search the large discr
Developed two approaches:
curvMesh bounds curvatur
gradMesh bounds gradient
magnitude) [WJ14]
If original model attractive, i.
(submodular cost functions),
multi-label problem is submo
Hence, can be solved via g
O(N3) where N = i∈V N
Obtain FPTAS with gradM
To compare, for curvMesh,
N = O −1/2n7/4∆3/4 exp
ZB to within
vably sufficient mesh M( ):
minq∈L F (q) + ?
rete mesh efficiently?
re [WJ13]
ts - typically much better (orders of
.e. Wij > 0 ∀(i , j) ∈ E
then show the discretized
odular [WJ13,KKL12]
graph cuts [SF06] i∈V Ni ]
Ni points in dim i [cf.
Mesh, N = O nmW
,
p 1 (W (1 + ∆/2) + T )
2
20 / 46
Bounding the locations of sta
For general edge types (associativ
Wi = j∈N(i):Wij >0 Wij , Vi = −
Theorem (WJ13)
At any stationary point of the Bet
σ(θi − Vi ) ≤ qi ≤ σ(θi + Wi )
Developed an algorithm (Bet
iteratively improves these bou
[MK07] already had a similar
possible beliefs in LBP - bit s
Use this to preprocess model
reduces search space direct
for curvMesh lowers max c
ationary points
ve or repulsive), let
j∈N(i):Wij <0 Wij
the free energy,
the bound propagation BBP) that
unds
r algorithm, finds ranges of
slower but typically better
to yield a smaller orthotope
tly
curvature, hence coarser mesh
21 / 46