class: middle, center, title-slide
Lecture: Inference in Bayesian networks
Prof. Gilles Louppe
[email protected]
???
Reorganize: use BNs as generators by sampling.
Discuss the sampling algorithms with code, to better communicate the inefficiency and intuition => All of this could be done with a notebook.
Skip the entire lecture?
.grid[ .kol-1-2[
- Exact inference
- Inference by enumeration
- Inference by variable elimination
- Approximate inference
.footnote[Image credits: CS188, UC Berkeley.]
.grid[ .kol-2-3[ A Bayesian network is a directed acyclic graph in which:
- Each node corresponds to a random variable
$X_i$ . - Each node
$X_i$ is annotated with a conditional probability distribution${\bf P}(X_i | \text{parents}(X_i))$ that quantifies the effect of the parents on the node.
A Bayesian network implicitly encodes the full joint distribution as the product of the local distributions:
]
.footnote[Image credits: CS188, UC Berkeley.]
???
Reminder: the topology encodes conditional independence assumptions!
class: middle
.grid[
.kol-3-4[.center.width-100[]]
.kol-1-4[.center.width-100[
]]
]
$$ \begin{aligned} P(b,\lnot e, a, \lnot j, m) &= P(b)P(\lnot e)P(a|b, \lnot e)P(\lnot j|a)P(m, a) \\\\ &= 0.001 \times 0.998 \times 0.94 \times 0.1 \times 0.7 \end{aligned}$$
class: middle
Inference is concerned with the problem computing a marginal and/or a conditional probability distribution from a joint probability distribution:
.grid[ .kol-1-3.center[Simple queries:] .kol-2-3[${\bf P}(X_i|e)$] ] .grid[ .kol-1-3.center[Conjunctive queries:] .kol-2-3[${\bf P}(X_i,X_j|e)={\bf P}(X_i|e){\bf P}(X_j|X_i,e)$] ] .grid[ .kol-1-3.center[Most likely explanation:] .kol-2-3[$\arg \max_q P(q|e)$] ] .grid[ .kol-1-3.center[Optimal decisions:] .kol-2-3[$\arg \max\_a \mathbb{E}_{p(s'|s,a)} \left[ V(s') \right]$] ]
.footnote[Image credits: CS188, UC Berkeley.]
???
Explain what
Insist on the importance of inference. Inference <=> reasoning.
Start from the joint distribution
- Select the entries consistent with the evidence
$E_1, ..., E_k = e_1, ..., e_k$ . - Marginalize out the hidden variables to obtain the joint of the query and the evidence variables:
$${\bf P}(Q,e_1,...,e_k) = \sum_{h_1, ..., h_r} {\bf P}(Q, h_1, ..., h_r, e_1, ..., e_k).$$ - Normalize:
$$\begin{aligned} Z &= \sum_q P(q,e_1,...,e_k) \\\\ {\bf P}(Q|e_1, ..., e_k) &= \frac{1}{Z} {\bf P}(Q,e_1,...,e_k) \end{aligned}$$
class: middle
Consider the alarm network and the query
$\begin{aligned}
{\bf P}(B|j,m) &= \frac{1}{Z} \sum_e \sum_a {\bf P}(B,j,m,e,a) \\
&\propto \sum_e \sum_a {\bf P}(B,j,m,e,a)
\end{aligned}$
Using the Bayesian network, the full joint entries can be rewritten as the product of CPT entries:
$\begin{aligned}
{\bf P}(B|j,m) &\propto \sum_e \sum_a {\bf P}(B)P(e){\bf P}(a|B,e)P(j|a)P(m|a)
\end{aligned}$
???
&\propto P(B) \sum_e P(e) \sum_a P(a|B,e)P(j|a)P(m|a)
class: middle
Inference by enumeration is slow because the whole joint distribution is joined up before summing out the hidden variables.
.footnote[Image credits: CS188, UC Berkeley.]
class: middle
Notice that factors that do not depend on the variables in the summations can be factored out, which means that marginalization does not necessarily have to be done at the end:
class: middle
Same complexity as DFS:
???
-
$n$ is the number of variables. -
$d$ is the size of their domain.
class: middle
Enumeration is still inefficient: there are repeated computations!
- e.g.,
$P(j|a)P(m|a)$ is computed twice, once for$e$ and once for$\lnot e$ . - These can be avoided by storing intermediate results.
???
Inefficient because the product is evaluated left-to-right, in a DFS manner.
The variable elimination (VE) algorithm carries out summations right-to-left and stores intermediate results (called factors) to avoid recomputations. The algorithm interleaves:
- Joining sub-tables
- Eliminating hidden variables
.footnote[Image credits: CS188, UC Berkeley.]
class: middle
class: middle
- Each factor
$\mathbf{f}_i$ is a multi-dimensional array indexed by the values of its argument variables. E.g.: .grid[ .kol-1-2[ $$ \begin{aligned} \mathbf{f}_4 &= \mathbf{f}_4(A) = \left(\begin{matrix} P(j|a) \\ P(j|\lnot a) \end{matrix}\right) = \left(\begin{matrix} 0.90 \\ 0.05 \end{matrix}\right) \\ \mathbf{f}_4(a) &= 0.90 \\ \mathbf{f}_4(\lnot a) &= 0.5 \end{aligned}$$ ] ] - Factors are initialized with the CPTs annotating the nodes of the Bayesian network, conditioned on the evidence.
class: middle
The pointwise product
- Exactly like a database join!
- The variables of
$\mathbf{f}_3$ are the union of the variables in$\mathbf{f}_1$ and$\mathbf{f}_2$ . - The elements of
$\mathbf{f}_3$ are given by the product of the corresponding elements in$\mathbf{f}_1$ and$\mathbf{f}_2$ .
class: middle
Summing out, or eliminating, a variable from a factor is done by adding up the sub-arrays formed by fixing the variable to each of its values in turn.
For example, to sum out
class: middle
Query:
- Start with the initial factors (the local CPTs, instantiated by the evidence).
- While there are still hidden variables:
- Pick a hidden variable
$H$ - Join all factors mentioning
$H$ - Eliminate H
- Pick a hidden variable
- Join all remaining factors
- Normalize
exclude: true class: middle, center
(blackboard example)
???
Prepare this for
class: middle
Consider the query
-
$\sum_m P(m|a) = 1$ , therefore$M$ is irrelevant for the query. - In other words,
${\bf P}(J|b)$ remains unchanged if we remove$M$ from the network.
.italic[Theorem.]
class: middle
Consider the query
Work through the two elimination orderings:
$Z, X_1, ..., X_{n-1}$ $X_1, ..., X_{n-1}, Z$
What is the size of the maximum factor generated for each of the orderings?
- Answer:
$2^{n+1}$ vs.$2^2$ (assuming boolean values)
class: middle
The computational and space complexity of variable elimination is determined by the largest factor.
- The elimination ordering can greatly affect the size of the largest factor.
- Does there always exist an ordering that only results in small factors? No!
- Greedy heuristic: eliminate whichever variable minimizes the size of the factor to be constructed.
- Singly connected networks (polytrees):
- Any two nodes are connected by at most one (undirected path).
- For these networks, time and space complexity of variable elimination are
$O(nd^k)$ .
class: middle
3SAT is a special case of inference:
- CSP:
$(u_1 \lor u_2 \lor u_3) \wedge (\lnot u_1 \lor \lnot u_2 \lor u_3) \wedge (u_2 \lor \lnot u_3 \lor u_4)$ $P(U_i=0)=P(U_i=1)=0.5$ -
$C_1 = U_1 \lor U_2 \lor U_3$ ;$C_2 = \lnot U_1 \lor \lnot U_2 \lor U_3$ ;$C_3 = U_2 \lor \lnot U_3 \lor U_4$ -
$D_1 = C_1$ ;$D_2 = D_1 \wedge C_2$ $Y = D_2 \wedge C_3$
???
SKIP since Lecture 2b is optional.
3SAT is the problem of determining the satisfiability of a formula in conjunctive normal form, where each clause is limited to a most three literals.
class: middle
If we can answer whether
- By reduction, inference in Bayesian networks is therefore NP-complete.
- There is no known efficient probabilistic inference algorithm in general.
???
Proof by reduction: transforming a problem (here inference) into another (here checking the satisfiability of a formula).
class: middle
class: middle
Exact inference is intractable for most probabilistic models of practical interest. (e.g., involving many variables, continuous and discrete, undirected cycles, etc).
.center.width-80[]
.center.width-100[
]
class: middle
Abandon exact inference and develop approximate but faster inference algorithms:
- Sampling methods: produce answers by repeatedly generating random numbers from a distribution of interest.
- Variational methods: formulate inference as an optimization problem.
- Belief propagation methods: formulate inference as a message-passing algorithm.
- Machine learning methods: learn an approximation of the target distribution from training examples.
Basic idea:
- Draw
$N$ samples from a sampling distribution$S$ . - Compute an approximate posterior probability
$\hat{P}$ . - Show this approximate converges to the true probability distribution
$P$ .
Generating samples is often much faster than computing the right answer (e.g., with variable elimination).
.footnote[Image credits: CS188, UC Berkeley.]
How to sample from the distribution of a discrete variable
- Assume
$k$ discrete outcomes$x_1, ..., x_k$ with probability$P(x_i)$ . - Assume sampling from the uniform
$\mathcal{U}(0,1)$ is possible.- e.g., as enabled by a standard
rand()
function.
- e.g., as enabled by a standard
- Divide the
$[0,1]$ interval into$k$ regions, with region$i$ having size$P(x_i)$ . - Sample
$u \sim \mathcal{U}(0,1)$ and return the value associated to the region in which$u$ falls.
.center.width-60[![](figures/archives-lec-inference/sampling.png)]
class: middle
.grid[
.kol-1-3.center[
] | |
.kol-2-3[ |
|
$$\begin{aligned} | |
0 \leq u < 0.6 &\to C = \text{red} \\ | |
0.6 \leq u < 0.7 &\to C = \text{green} \\ | |
0.7 \leq u < 1 &\to C = \text{blue} \\ | |
\end{aligned}$$ | |
] | |
] |
.footnote[Image credits: CS188, UC Berkeley.]
Sampling from a Bayesian network, without observed evidence:
- Sample each variable in turn, in topological order.
- The probability distribution from which the value is sampled is conditioned on the values already assigned to the variable's parents.
class: middle
.footnote[Image credits: CS188, UC Berkeley.]
???
Comment on the cartoon:
- sample first a shape
- then a color given the shape
class: middle
class: middle count: false
class: middle count: false
class: middle count: false
class: middle count: false
class: middle count: false
class: middle count: false
class: middle
We will collect a bunch of samples from the Bayesian network:
If we want to know
- We have counts
$\langle w:4, \lnot w:1 \rangle$ - Normalize to obtain
$\hat{{\bf P}}(W) = \langle w:0.8, \lnot w:0.2 \rangle$ -
$\hat{{\bf P}}(W)$ will get closer to the true distribution${\bf P}(W)$ as we generate more samples.
class: middle
The probability that prior sampling generates a particular event is
Let
class: middle
Then,
$$\begin{aligned}
\lim_{N \to \infty} \hat{P}(x_1,...,x_n) &= \lim_{N \to \infty} N_\text{PS}(x_1, ..., x_n) / N \\
&= S_\text{PS}(x_1, ..., x_n) \\
&= P(x_1, ..., x_n)
\end{aligned}$$
Therefore, prior sampling is consistent:
Using prior sampling, an estimate
.footnote[Image credits: CS188, UC Berkeley.]
???
Cartoon:
- reject all samples which are not blue (the evidence)
class: middle
class: middle
Let consider the posterior probability estimator
Therefore, rejection sampling is consistent.
- The standard deviation of the error in each probability is
$O(1/\sqrt{n})$ , where$n$ is the number of samples used to compute the estimate. -
Problem: many samples are rejected!
- Hopelessly expensive if the evidence is unlikely, i.e. if
$P(e)$ is small. - Evidence is not exploited when sampling.
- Hopelessly expensive if the evidence is unlikely, i.e. if
???
From the prior sampling analysis, we know that N_PS / N ---> P.
Idea: clamp the evidence variables, sample the rest.
- Problem: the resulting sampling distribution is not consistent.
- Solution: weight by probability of evidence given parents.
.footnote[Image credits: CS188, UC Berkeley.]
???
Cartoon: note how blue is now enforced and no longer drawn at random.
class: middle
???
Probability estimates are built by summing up weights instead of ones and normalizing.
class: middle
class: middle count: false
class: middle count: false
class: middle count: false
class: middle count: false
class: middle
The sampling probability for an event with likelihood weighting is
The weighted sampling probability is $$ \begin{aligned} S_\text{WS}(x,e) w(x,e) &= \prod_{i=1}^l P(x_i|\text{parents}(X_i)) \prod_{i=1}^m P(e_i|\text{parents}(E_i)) \\ &= P(x,e) \end{aligned} $$
class: middle
The estimated posterior probability is computed as follows:
Hence likelihood weighting returns consistent estimates.
class: middle
- Likelihood weighting is helpful:
- The evidence is taken into account to generate a sample.
- More samples will reflect the state of the world suggested by the evidence.
- Likelihood weighting does not solve all problems:
- Performance degrades as the number of evidence variable increases.
- The evidence influences the choice of downstream variables, but not upstream ones.
- Ideally, we would like to consider the evidence when we sample each and every variable.
- Markov chain Monte Carlo (MCMC) algorithms are a family of sampling algorithms that generate samples through a Markov chain.
- They generate a sequence of samples by making random changes to a preceding sample, instead of generating each sample from scratch.
- Helpful to think of a Bayesian network as being in a particular current state specifying a value for each variable and generating a next state by making random changes to the current state.
- Metropolis-Hastings is one of the most famous MCMC methods, of which Gibbs sampling is a special case.
class: middle
- Start with an arbitrary instance
$x_1, ..., x_n$ consistent with the evidence. - Sample one variable at a time, conditioned on all the rest, but keep the evidence fixed.
- Keep repeating this for a long time.
.center.width-100[![](figures/archives-lec-inference/gibbs-sampling.png)]
class: middle
- Both upstream and downstream variables condition on evidence.
- In contrast, likelihood weighting only conditions on upstream evidence, and hence the resulting weights might be very small.
class: middle
.grid[ .kol-1-4[
- Choose a non-evidence variable
$X$ . - Resample
$X$ from${\bf P}(X|\text{all other variables})$ .
class: middle
See code/lecture5-gibbs.ipynb
.
class: middle
The sampling process settles into a dynamic equilibrium in which the long-run fraction of time spent in each state is exactly proportional to its posterior probability.
See 14.5.2 for a technical proof.
- Exact inference by variable elimination .
- NP-complete on general graphs, but polynomial on polytrees.
- space = time, very sensitive to topology.
- Approximate inference gives reasonable estimates of the true posterior probabilities in a network and can cope with much larger networks than can exact algorithms.
- Likelihood weighting does poorly when there is lots of evidence.
- Likelihood weighting and Gibbs sampling are generally insensitive to topology.
- Convergence can be slow with probabilities close to 1 or 0.
- Can handle arbitrary combinations of discrete and continuous variables.
class: end-slide, center count: false
The end.