Introduction to Artificial Intelligence

Lecture: Inference in Bayesian networks

???

Reorganize: use BNs as generators by sampling.

Discuss the sampling algorithms with code, to better communicate the inefficiency and intuition => All of this could be done with a notebook.

Skip the entire lecture?

Today

Exact inference
- Inference by enumeration
- Inference by variable elimination
Approximate inference
- Ancestral sampling
- Rejection sampling
- Likelihood weighting
- Gibbs sampling ] .kol-1-2[
  
  .width-100[] ] ]

Bayesian networks

Each node corresponds to a random variable $X_i$.
Each node $X_i$ is annotated with a conditional probability distribution ${\bf P}(X_i | \text{parents}(X_i))$ that quantifies the effect of the parents on the node.

A Bayesian network implicitly encodes the full joint distribution as the product of the local distributions: $$P(x_1, ..., x_n) = \prod_{i=1}^n P(x_i | \text{parents}(X_i))$$ ] .kol-1-3.center[.width-100[]

???

Reminder: the topology encodes conditional independence assumptions!

$$ \begin{aligned} P(b,\lnot e, a, \lnot j, m) &= P(b)P(\lnot e)P(a|b, \lnot e)P(\lnot j|a)P(m, a) \\\\ &= 0.001 \times 0.998 \times 0.94 \times 0.1 \times 0.7 \end{aligned}$$

Exact inference

Inference

Inference is concerned with the problem computing a marginal and/or a conditional probability distribution from a joint probability distribution:

.grid[ .kol-1-3.center[Simple queries:] .kol-2-3[${\bf P}(X_i|e)$] ] .grid[ .kol-1-3.center[Conjunctive queries:] .kol-2-3[${\bf P}(X_i,X_j|e)={\bf P}(X_i|e){\bf P}(X_j|X_i,e)$] ] .grid[ .kol-1-3.center[Most likely explanation:] .kol-2-3[$\arg \max_q P(q|e)$] ] .grid[ .kol-1-3.center[Optimal decisions:] .kol-2-3[$\arg \max\_a \mathbb{E}_{p(s'|s,a)} \left[ V(s') \right]$] ]

.center.width-30[]

???

Explain what $\arg \max$ means.

Insist on the importance of inference. Inference <=> reasoning.

Inference by enumeration

Start from the joint distribution ${\bf P}(Q, E_1, ..., E_k, H_1, ..., H_r)$.

Select the entries consistent with the evidence $E_1, ..., E_k = e_1, ..., e_k$.
Marginalize out the hidden variables to obtain the joint of the query and the evidence variables: $${\bf P}(Q,e_1,...,e_k) = \sum_{h_1, ..., h_r} {\bf P}(Q, h_1, ..., h_r, e_1, ..., e_k).$$
Normalize:

$$\begin{aligned} Z &= \sum_q P(q,e_1,...,e_k) \\\\ {\bf P}(Q|e_1, ..., e_k) &= \frac{1}{Z} {\bf P}(Q,e_1,...,e_k) \end{aligned}$$

Consider the alarm network and the query ${\bf P}(B|j,m)$:

$\begin{aligned} {\bf P}(B|j,m) &= \frac{1}{Z} \sum_e \sum_a {\bf P}(B,j,m,e,a) \\ &\propto \sum_e \sum_a {\bf P}(B,j,m,e,a) \end{aligned}$

Using the Bayesian network, the full joint entries can be rewritten as the product of CPT entries:

$\begin{aligned} {\bf P}(B|j,m) &\propto \sum_e \sum_a {\bf P}(B)P(e){\bf P}(a|B,e)P(j|a)P(m|a) \end{aligned}$

???

&\propto P(B) \sum_e P(e) \sum_a P(a|B,e)P(j|a)P(m|a)

.center.width-80[]

Inference by enumeration is slow because the whole joint distribution is joined up before summing out the hidden variables.

Notice that factors that do not depend on the variables in the summations can be factored out, which means that marginalization does not necessarily have to be done at the end:

$$\begin{aligned} {\bf P}(B|j,m) &\propto \sum_e \sum_a {\bf P}(B)P(e){\bf P}(a|B,e)P(j|a)P(m|a) \\\ &= {\bf P}(B) \sum_e P(e) \sum_a {\bf P}(a|B,e)P(j|a)P(m|a) \end{aligned}$$

.center.width-100[]

Same complexity as DFS: $O(n)$ in space, $O(d^n)$ in time.

???

$n$ is the number of variables.
$d$ is the size of their domain.

Evaluation tree for $P(b|j,m)$

.center.width-80[]

Enumeration is still inefficient: there are repeated computations!

e.g., $P(j|a)P(m|a)$ is computed twice, once for $e$ and once for $\lnot e$.
These can be avoided by storing intermediate results.

???

Inefficient because the product is evaluated left-to-right, in a DFS manner.

Inference by variable elimination

The variable elimination (VE) algorithm carries out summations right-to-left and stores intermediate results (called factors) to avoid recomputations. The algorithm interleaves:

Joining sub-tables
Eliminating hidden variables

.center.width-80[]

.center.width-30[]

Example

$$\begin{aligned} {\bf P}(B|j, m) &\propto {\bf P}(B,j,m) \\\ &= {\bf P}(B) \sum_e P(e) \sum_a {\bf P}(a|B,e)P(j|a)P(m|a) \\\ &= \mathbf{f}_1(B) \times \sum_e \mathbf{f}_2(e) \times \sum_a \mathbf{f}_3(a,B,e) \times \mathbf{f}_4(a) \times \mathbf{f}_5(a) \\\ &= \mathbf{f}_1(B) \times \sum_e \mathbf{f}_2(e) \times \mathbf{f}_6(B,e) \quad\text{ (sum out } A\text{)} \\\ &= \mathbf{f}_1(B) \times \mathbf{f}_7(B) \quad\text{ (sum out } E\text{)} \\\ \end{aligned}$$

Factors

Each factor $\mathbf{f}_i$ is a multi-dimensional array indexed by the values of its argument variables. E.g.: .grid[ .kol-1-2[ $$ \begin{aligned} \mathbf{f}_4 &= \mathbf{f}_4(A) = \left(\begin{matrix} P(j|a) \\ P(j|\lnot a) \end{matrix}\right) = \left(\begin{matrix} 0.90 \\ 0.05 \end{matrix}\right) \\ \mathbf{f}_4(a) &= 0.90 \\ \mathbf{f}_4(\lnot a) &= 0.5 \end{aligned}$$ ] ]
Factors are initialized with the CPTs annotating the nodes of the Bayesian network, conditioned on the evidence.

Join

The pointwise product $\times$, or join, of two factors $\mathbf{f}_1$ and $\mathbf{f}_2$ yields a new factor $\mathbf{f}_3$.

Exactly like a database join!
The variables of $\mathbf{f}_3$ are the union of the variables in $\mathbf{f}_1$ and $\mathbf{f}_2$.
The elements of $\mathbf{f}_3$ are given by the product of the corresponding elements in $\mathbf{f}_1$ and $\mathbf{f}_2$.

.center.width-100[]

Elimination

Summing out, or eliminating, a variable from a factor is done by adding up the sub-arrays formed by fixing the variable to each of its values in turn.

For example, to sum out $A$ from $\mathbf{f}_3(A, B, C)$, we write:

$$\begin{aligned} \mathbf{f}(B,C) &= \sum_a \mathbf{f}_3(a, B, C) = \mathbf{f}_3(a, B, C) + \mathbf{f}_3(\lnot a, B, C) \\\ &= \left(\begin{matrix} 0.06 & 0.24 \\\ 0.42 & 0.28 \end{matrix}\right) + \left(\begin{matrix} 0.18 & 0.72 \\\ 0.06 & 0.04 \end{matrix}\right) = \left(\begin{matrix} 0.24 & 0.96 \\\ 0.48 & 0.32 \end{matrix}\right) \end{aligned}$$

General Variable Elimination algorithm

Query: ${\bf P}(Q|e_1, ..., e_k)$.

Start with the initial factors (the local CPTs, instantiated by the evidence).
While there are still hidden variables:
1. Pick a hidden variable $H$
2. Join all factors mentioning $H$
3. Eliminate H
Join all remaining factors
Normalize

(blackboard example)

???

Prepare this for $P(B|j,m)$.

Relevance

Consider the query ${\bf P}(J|b)$: $${\bf P}(J|b) \propto P(b) \sum_e P(e) \sum_a P(a|b,e) {\bf P}(J|a) \sum_m P(m|a)$$

$\sum_m P(m|a) = 1$, therefore $M$ is irrelevant for the query.
In other words, ${\bf P}(J|b)$ remains unchanged if we remove $M$ from the network.

.italic[Theorem.] $H$ is irrelevant for ${\bf P}(Q|e)$ unless $H \in \text{ancestors}(\{Q\} \cup E)$.

Complexity

.center.width-50[]

Consider the query ${\bf P}(X_n|y_1,...,y_n)$.

Work through the two elimination orderings:

$Z, X_1, ..., X_{n-1}$
$X_1, ..., X_{n-1}, Z$

What is the size of the maximum factor generated for each of the orderings?

Answer: $2^{n+1}$ vs. $2^2$ (assuming boolean values)

The computational and space complexity of variable elimination is determined by the largest factor.

The elimination ordering can greatly affect the size of the largest factor.
Does there always exist an ordering that only results in small factors? No!
- Greedy heuristic: eliminate whichever variable minimizes the size of the factor to be constructed.
- Singly connected networks (polytrees):
  - Any two nodes are connected by at most one (undirected path).
  - For these networks, time and space complexity of variable elimination are $O(nd^k)$.

Worst-case complexity?

.center.width-80[]

3SAT is a special case of inference:

CSP: $(u_1 \lor u_2 \lor u_3) \wedge (\lnot u_1 \lor \lnot u_2 \lor u_3) \wedge (u_2 \lor \lnot u_3 \lor u_4)$
$P(U_i=0)=P(U_i=1)=0.5$
$C_1 = U_1 \lor U_2 \lor U_3$; $C_2 = \lnot U_1 \lor \lnot U_2 \lor U_3$; $C_3 = U_2 \lor \lnot U_3 \lor U_4$
$D_1 = C_1$; $D_2 = D_1 \wedge C_2$
$Y = D_2 \wedge C_3$

???

SKIP since Lecture 2b is optional.

3SAT is the problem of determining the satisfiability of a formula in conjunctive normal form, where each clause is limited to a most three literals.

If we can answer whether $P(Y=1)>0$, then we answer whether 3SAT has a solution.

By reduction, inference in Bayesian networks is therefore NP-complete.
There is no known efficient probabilistic inference algorithm in general.

???

Proof by reduction: transforming a problem (here inference) into another (here checking the satisfiability of a formula).

Approximate inference

Exact inference is intractable for most probabilistic models of practical interest. (e.g., involving many variables, continuous and discrete, undirected cycles, etc).

.center.width-80[] .center.width-100[]

Solution

Abandon exact inference and develop approximate but faster inference algorithms:

Sampling methods: produce answers by repeatedly generating random numbers from a distribution of interest.
Variational methods: formulate inference as an optimization problem.
Belief propagation methods: formulate inference as a message-passing algorithm.
Machine learning methods: learn an approximation of the target distribution from training examples.

Sampling methods

Basic idea:

Draw $N$ samples from a sampling distribution $S$.
Compute an approximate posterior probability $\hat{P}$.
Show this approximate converges to the true probability distribution $P$.

Why sampling?

Generating samples is often much faster than computing the right answer (e.g., with variable elimination).

.center.width-70[]

Sampling

How to sample from the distribution of a discrete variable $X$?

Assume $k$ discrete outcomes $x_1, ..., x_k$ with probability $P(x_i)$.
Assume sampling from the uniform $\mathcal{U}(0,1)$ is possible.
- e.g., as enabled by a standard rand() function.
Divide the $[0,1]$ interval into $k$ regions, with region $i$ having size $P(x_i)$.
Sample $u \sim \mathcal{U}(0,1)$ and return the value associated to the region in which $u$ falls.

.center.width-60[![](figures/archives-lec-inference/sampling.png)]

$C$	$P$
$\text{red}$	$0.6$
$\text{green}$	$0.1$
$\text{blue}$	$0.3$
]
.kol-2-3[
$$\begin{aligned}
0 \leq u < 0.6 &\to C = \text{red} \\
0.6 \leq u < 0.7 &\to C = \text{green} \\
0.7 \leq u < 1 &\to C = \text{blue} \\
\end{aligned}$$
]
]

.center.width-70[]

Prior sampling

Sampling from a Bayesian network, without observed evidence:

Sample each variable in turn, in topological order.
The probability distribution from which the value is sampled is conditioned on the values already assigned to the variable's parents.

.center.width-100[]

???

Comment on the cartoon:

sample first a shape
then a color given the shape

.center.width-90[]

.center.width-90[]

.center.width-90[]

.center.width-90[]

.center.width-90[]

.center.width-90[]

.center.width-90[]

Example

We will collect a bunch of samples from the Bayesian network:

$c, \lnot s, r, w$
$c, s, r, w$
$\lnot c, s, r, \lnot w$
$c, \lnot s, r, w$
$\lnot c, \lnot s, \lnot r, w$

If we want to know ${\bf P}(W)$:

We have counts $\langle w:4, \lnot w:1 \rangle$
Normalize to obtain $\hat{{\bf P}}(W) = \langle w:0.8, \lnot w:0.2 \rangle$
$\hat{{\bf P}}(W)$ will get closer to the true distribution ${\bf P}(W)$ as we generate more samples.

Analysis

The probability that prior sampling generates a particular event is $$S_\text{PS}(x_1, ..., x_n) = \prod_{i=1}^n P(x_i | \text{parents}(X_i)) = P(x_1,...,x_n)$$ i.e., the Bayesian network's joint probability.

Let $N_\text{PS}(x_1, ..., x_n)$ denote the number of samples of an event. We define the probability estimator $$\hat{P}(x_1, ..., x_n) = N_\text{PS}(x_1, ..., x_n) / N.$$

Then, $$\begin{aligned} \lim_{N \to \infty} \hat{P}(x_1,...,x_n) &= \lim_{N \to \infty} N_\text{PS}(x_1, ..., x_n) / N \\ &= S_\text{PS}(x_1, ..., x_n) \\ &= P(x_1, ..., x_n) \end{aligned}$$ Therefore, prior sampling is consistent: $$P(x_1, ..., x_n) \approx N_\text{PS}(x_1, ..., x_n) / N$$

Rejection sampling

Using prior sampling, an estimate $\hat{P}(x|e)$ can be formed from the proportion of samples $x$ agreeing with the evidence $e$ among all samples agreeing with the evidence.

.center.width-100[]

???

Cartoon:

reject all samples which are not blue (the evidence)

.center.width-100[]

Analysis

Let consider the posterior probability estimator $\hat{P}(x|e)$ formed by rejection sampling:

$$\begin{aligned} \hat{P}(x|e) &= N_\text{PS}(x,e) / N_\text{PS}(e) \\\ &= \frac{N_\text{PS}(x,e)}{N} / \frac{N_\text{PS}(e)}{N} \\\ &\approx P(x,e) / P(e) \\\ &= P(x|e) \end{aligned}$$

Therefore, rejection sampling is consistent.

The standard deviation of the error in each probability is $O(1/\sqrt{n})$, where $n$ is the number of samples used to compute the estimate.
Problem: many samples are rejected!
- Hopelessly expensive if the evidence is unlikely, i.e. if $P(e)$ is small.
- Evidence is not exploited when sampling.

???

From the prior sampling analysis, we know that N_PS / N ---> P.

Likelihood weighting

Idea: clamp the evidence variables, sample the rest.

Problem: the resulting sampling distribution is not consistent.
Solution: weight by probability of evidence given parents.

.center.width-100[]

???

Cartoon: note how blue is now enforced and no longer drawn at random.

.center.width-100[]

???

Probability estimates are built by summing up weights instead of ones and normalizing.

.center.width-100[]

.center.width-100[]

.center.width-100[]

.center.width-100[]

.center.width-100[]

Analysis

The sampling probability for an event with likelihood weighting is $$S_\text{WS}(x,e) = \prod_{i=1}^l P(x_i|\text{parents}(X_i)),$$ where the product is over the non-evidence variables. The weight for a given sample $x,e$ is $$w(x,e) = \prod_{i=1}^m P(e_i|\text{parents}(E_i)),$$ where the product is over the evidence variables.

The weighted sampling probability is $$ \begin{aligned} S_\text{WS}(x,e) w(x,e) &= \prod_{i=1}^l P(x_i|\text{parents}(X_i)) \prod_{i=1}^m P(e_i|\text{parents}(E_i)) \\ &= P(x,e) \end{aligned} $$

The estimated posterior probability is computed as follows:

$$\begin{aligned} \hat{P}(x|e) &\propto N_\text{WS}(x,e) w(x,e) \\\ &\propto S_\text{WS}(x,e) w(x,e) \\\ &\propto P(x,e) \\\ &\propto P(x|e). \end{aligned}$$

Hence likelihood weighting returns consistent estimates.

Likelihood weighting is helpful:
- The evidence is taken into account to generate a sample.
- More samples will reflect the state of the world suggested by the evidence.
Likelihood weighting does not solve all problems:
- Performance degrades as the number of evidence variable increases.
- The evidence influences the choice of downstream variables, but not upstream ones.
  - Ideally, we would like to consider the evidence when we sample each and every variable.

Inference by Markov chain simulation

Markov chain Monte Carlo (MCMC) algorithms are a family of sampling algorithms that generate samples through a Markov chain.
They generate a sequence of samples by making random changes to a preceding sample, instead of generating each sample from scratch.
Helpful to think of a Bayesian network as being in a particular current state specifying a value for each variable and generating a next state by making random changes to the current state.
Metropolis-Hastings is one of the most famous MCMC methods, of which Gibbs sampling is a special case.

Gibbs sampling

Start with an arbitrary instance $x_1, ..., x_n$ consistent with the evidence.
Sample one variable at a time, conditioned on all the rest, but keep the evidence fixed.
Keep repeating this for a long time.

.center.width-100[![](figures/archives-lec-inference/gibbs-sampling.png)]

Both upstream and downstream variables condition on evidence.
In contrast, likelihood weighting only conditions on upstream evidence, and hence the resulting weights might be very small.

Example

Fix the evidence ] .kol-1-4.width-100[] .kol-1-4[
Randomly initialize the other variables ] .kol-1-4.width-100[] ]
Repeat

Choose a non-evidence variable $X$.
Resample $X$ from ${\bf P}(X|\text{all other variables})$.

.center.width-100[]

Demo

See code/lecture5-gibbs.ipynb.

Rationale

The sampling process settles into a dynamic equilibrium in which the long-run fraction of time spent in each state is exactly proportional to its posterior probability.

See 14.5.2 for a technical proof.

Summary

Exact inference by variable elimination .
- NP-complete on general graphs, but polynomial on polytrees.
- space = time, very sensitive to topology.
Approximate inference gives reasonable estimates of the true posterior probabilities in a network and can cope with much larger networks than can exact algorithms.
- Likelihood weighting does poorly when there is lots of evidence.
- Likelihood weighting and Gibbs sampling are generally insensitive to topology.
- Convergence can be slow with probabilities close to 1 or 0.
- Can handle arbitrary combinations of discrete and continuous variables.

The end.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

archives-lecture-inference.md

archives-lecture-inference.md

Introduction to Artificial Intelligence

Today

Bayesian networks

Exact inference

Inference

Inference by enumeration

Evaluation tree for $P(b|j,m)$

Inference by variable elimination

Example

Factors

Join

Elimination

General Variable Elimination algorithm

Relevance

Complexity

Worst-case complexity?

Approximate inference

Solution

Sampling methods

Why sampling?

Sampling

Prior sampling

Example

Analysis

Rejection sampling

Analysis

Likelihood weighting

Analysis

Inference by Markov chain simulation

Gibbs sampling

Example

Demo

Rationale

Summary

Files

archives-lecture-inference.md

Latest commit

History

archives-lecture-inference.md

File metadata and controls

Introduction to Artificial Intelligence

Today

Bayesian networks

Exact inference

Inference

Inference by enumeration

Evaluation tree for $P(b|j,m)$

Inference by variable elimination

Example

Factors

Join

Elimination

General Variable Elimination algorithm

Relevance

Complexity

Worst-case complexity?

Approximate inference

Solution

Sampling methods

Why sampling?

Sampling

Prior sampling

Example

Analysis

Rejection sampling

Analysis

Likelihood weighting

Analysis

Inference by Markov chain simulation

Gibbs sampling

Example

Demo

Rationale

Summary