exact inference

villanuevab · Mar 16, 2016 · 7c3cb98 · 7c3cb98
1 parent 0f9eb80
commit 7c3cb98
Show file tree

Hide file tree

Showing 13 changed files with 359 additions and 35 deletions.
diff --git a/_includes/header.html b/_includes/header.html
@@ -1,18 +1,8 @@
 <!--- Header and nav template site-wide -->
 <header>
     <nav class="group">
-	<!-- <a href="{{site.baseurl}}/"><img class="badge" src="{{site.baseurl}}/assets/img/badge_1.png" alt="CH"></a> -->
-	<!-- {% for node in site.pages %} -->
-	<!-- 	{% unless node.nav_exclude %} -->
-	<!-- 	    {% if page.url == node.url %} -->
-	<!-- 	      <a class="active" href="{{node.url | prepend: site.baseurl}}" class="active">{{node.title}}</a> -->
-	<!-- 	    {% else %} -->
-	<!-- 	      <a href="{{node.url | prepend: site.baseurl}}">{{node.title}}</a> -->
-	<!-- 	    {% endif %} -->
-	<!--     {% endunless %} -->
-  	<!-- {% endfor %} -->
         <a href="{{ site.baseurl }}/">Contents</a>
 	<a href="http://cs.stanford.edu/~ermon/cs228/index.html">Class</a>
-	<a href="#">Github</a>
+	<a href="http://github.com/kuleshov/cs228-notes">Github</a>
 	</nav>
 </header>
diff --git a/_layouts/post.html b/_layouts/post.html
@@ -22,6 +22,8 @@ <h1>{{ page.title | capitalize }}</h1>
         Ind: "{\\mathbb{I}}",
         KL: "{\\mathbb{KL}}",
         Dc: "{\\mathcal{D}}",
+        Tc: "{\\mathcal{T}}",
+        Xc: "{\\mathcal{X}}",
         note: ["\\textcolor{blue}{[NOTE: #1]}",1]
       }
     }

diff --git a/assets/img/badjunctiontree.png b/assets/img/badjunctiontree.png
diff --git a/assets/img/jt-over-tree.png b/assets/img/jt-over-tree.png
diff --git a/assets/img/junctionpath.png b/assets/img/junctionpath.png
diff --git a/assets/img/junctiontree.png b/assets/img/junctiontree.png
diff --git a/assets/img/marginalization.png b/assets/img/marginalization.png
diff --git a/assets/img/mp1.png b/assets/img/mp1.png
diff --git a/index.md b/index.md
@@ -3,10 +3,10 @@ layout: post
 title: Contents
 ---
 {% newthought 'These notes'%} form a concise introductory course on probabilistic graphical modeling{% sidenote 1 'Probabilistic graphical modeling is a subfield of AI that studies how to model the world with probability distributions.'%}.
-They accompany and are based on the material of [CS228](cs.stanford.edu/~ermon/cs228/index.html), the graphical models course at Stanford University, taught by [Stefano Ermon](cs.stanford.edu/~ermon/).
+They accompany and are based on the material of [CS228](http://cs.stanford.edu/~ermon/cs228/index.html), the graphical models course at Stanford University, taught by [Stefano Ermon](http://cs.stanford.edu/~ermon/).
 
-The notes are written and maintained by [Volodymyr Kuleshov](www.stanford.edu/~kuleshov); contact me with any feedback and feel free to contribute your improvements on [Github](https://github.com/kuleshov/cs228-notes).
 This site is currently under construction, but come back soon as we get more material online.
+In the meantime, contact [Volodymyr Kuleshov](http://www.stanford.edu/~kuleshov) with any feedback and feel free to contribute your improvements on [Github](https://github.com/kuleshov/cs228-notes).
 
 ## Preliminaries
 
@@ -24,9 +24,9 @@ This site is currently under construction, but come back soon as we get more mat
 
 ## Inference
 
-1. [Variable elimination](#): The inference problem. Variable elimination. Complexity of inference.
+1. [Variable elimination](inference/ve/) The inference problem. Variable elimination. Complexity of inference.
 
-2. Belief propagation: The junction tree algorithm. Exact inference in arbitrary graphs.
+2. [Belief propagation](inference/jt/): The junction tree algorithm. Exact inference in arbitrary graphs.
 
 3. Sampling-based inference: Monte-Carlo sampling. Importance sampling. Markov Chain Monte-Carlo. Applications in inference.
 

diff --git a/inference/jt/index.md b/inference/jt/index.md
diff --git a/inference/ve/index.md b/inference/ve/index.md
@@ -0,0 +1,147 @@
+---
+layout: post
+title: Variable Elimination
+---
+Next, we turn our attention to the problem of *inference* in graphical models.
+Given a probabilistic model (such as a Bayes net on an MRF), we are interested in using it to answer useful questions, e.g. determining the probability that a given email is spam.  More formally, we will be focusing on two types of questions:
+
+- *Marginal inference*: what is the probability of a given variable in our model after we sum everything else out (e.g. probability of spam vs non-spam)?
+{% math %}
+p(y=1) = \sum_{x_2} \sum_{x_2}  \cdots \sum_{x_n} p(y=1,x_1, x_2, ..., x_n).
+{% endmath %}
+- *Maximum a posteriori (MAP) inference*: what is the most likely assignment to the variables in the model (possibly conditioned on evidence).
+{% math %}
+\max_{x_1, \dots, x_n} p(y=1, x_1,...,x_n)
+{% endmath %}
+
+It turns out that inference is a challenging task. For many probabilities of interest, it will be NP-hard to answer any of these questions. Crucially, whether inference is tractable will depend on the structure of the graph that describes that probability. If a problem is intractable, we will still be able to obtain useful answers via approximate inference methods.
+
+This chapter covers the first exact inference algorithm, *variable elimination*. We will discuss approximate inference in later chapters.
+
+
+## An illustrative example
+
+Consider first the problem of marginal inference. Suppose for simplicity that we a given a chain Bayesian network, i.e. a probability of the form
+{% math %}
+p(x_1,...,x_n) = p(x_1) \prod_{i=2}^n p(x_i \mid x_{i-1}).
+{% endmath %}
+We are interesteded in computing the marginal probability $$p(x_n)$$. We will assume for the rest of the chapter that the $$x_i$$ are discrete variables taking $$d$$ possible values each{% sidenote 1 'The principles behind variable elimination also extend to many continuous distributions (e.g. Gaussians), but we will not discuss these extensions here.'%}.
+
+The naive way of performing this is to sum the proability over all the $$d^{n-1}$$ assignments to $$x_1,...,x_{n-1}$$:
+{% math %}
+p(x_n) = \sum_{x_1} \cdots \sum_{x_{n-1}} p(x_1,...,x_n).
+{% endmath %}
+
+However, we can do much better by leveraging the factorization of our probability distribution. We may rewrite the sum in a way that ``pushes in" certain variables deeper into the product.
+{% math %}
+\begin{align*}
+p(x_n) 
+& = \sum_{x_1} \cdots \sum_{x_{n-1}} p(x_1) \prod_{i=2}^n p(x_i \mid x_{i-1}) \\
+& = \sum_{x_{n-1}} p(x_n \mid x_{n-1}) \sum_{x_{n-2}} p(x_{n-1} \mid x_{n-2}) \cdots \sum_{x_1} p(x_2 \mid x_1) p(x_1) .
+\end{align*}
+{% endmath %}
+We perform this summation by first summing the inner terms, starting from $$x_1$$, and ending with $$x_{n-1}$$. More concretely, we start by computing an intermediary *factor* $$\tau(x_2) = \sum_{x_1} p(x_2 \mid x_1) p(x_1)$$ by summing out $$x_1$$. This takes $$O(d^2)$$ time because we must sum over $$x_1$$ for eacha assignment to $$x_1$$. The resulting factor $$\tau(x_2)$$ can be thought of as a table of values (though not necessarily probabilities), with one entry for eacha assignment to $$x_2$$ (just as factor $$p(x_1)$$ can be represented as a table. We may then rewrite the marginal probability using $$\tau$$ as
+{% math %}
+p(x_n) = \sum_{x_{n-1}} p(x_n \mid x_{n-1}) \sum_{x_{n-2}} p(x_{n-1} \mid x_{n-2}) \cdots \sum_{x_2} p(x_3 \mid x_2) \tau(x_2).
+{% endmath %}
+
+Note that this has the same form as the initial expression, except that we are summing over one fewer variable{% sidenote 1 'This technique is a special case of *dynamic programming*, a general algorithm design approach in which we break apart a larger problem into a sequence of smaller ones.'%}. We may therefore compute another factor $$\tau(x_3) = \sum_{x_2} p(x_3 \mid x_2) \tau(x_2)$$, and repeat the process until we are only left with $$x_n$$. Since each step takes $$O(d^2)$$ time, and we perform $$O(n)$$ steps, inference now takes $$O(n d^2)$$ time, which is much better than our naive $$O(d^n)$$ solution.
+
+Also, at each time, we are *eliminating* a variable, which gives the algorithm its name.
+
+## Eliminating Variables
+
+Having established some intuitions, with a special case, we will now introduce the variable elimination algorithm in its most general form.
+
+### Factors
+
+We will assume that we are given a graphical model as a product of factors
+{% math %}
+p(x_1,..,x_n) = \prod_{c \in C} \phi_c(x_c).
+{% endmath %}
+
+Recall that we can view a factor as a multi-dimensional table assigning a value to each assignment of a set of variables $$x_c$$. In the context of a Bayesian network, the factors correspond to conditional probability distributions; however, this definition also makes our algorithm equally applicable to Markov Random Fields. In this latter case, the factors encode an unnormalized distribution; to compute marginals, we first calculate the partition function (also using variable elimination), then we compute marginals using the unnormalized distribution, and finally we divide the result by the partition constant to construct a valid marginal probability.
+
+### Factor Operations
+
+The variable elimination algorithm will repeatedly perform two factor operations: product and marginalization. We have been implicitly been performing these operations in our chain example.
+
+The factor product operation simply defines the product $$\phi_3 := \phi_1 \times \phi_2$$ of two factors $$\phi_1, \phi_2$$ as
+{% math %}
+\phi_3(x_c) = \phi_1(x_c^{(1)}) \times \phi_2(x_c^{(2)}).
+{% endmath %}
+The scope of $$\phi_3$$ is defined as the union of the variables in the scopes of $$\phi_1, \phi_2$$; also $$x_c^{(i)}$$ denotes an assignment to the variables in the scope of $$\phi_i$$ defined by the restriction of $$x_c$$ to that scope. For example, we define $$\phi_3(a,b,c) := \phi_1(a,b) \times \phi(b,c)$$.
+
+Next, the marginalization operation "locally" eliminates a set of variable from a factor. If we have a factor $$\phi(X,Y)$$ over two sets of variables $$X,Y$$, marginalizing $$Y$$ produces a new factor 
+{% math %}
+\tau(x) = \sum_{y} \phi(x, y),
+{% endmath %}
+where the sum is over all joint assignments to the set of variables $$Y$$.{% marginfigure 'marg' 'assets/img/marginalization.png' 'Here, we are marginalizing out variable $$B$$ from factor $$\phi(A,B,C).$$'%}
+
+We use $$\tau$$ to refer to the marginalized factor. It is important to understand that this factor does not necessarily correspond to a probability distribution, even if $$\phi$$ was a CPD.
+
+### Orderings
+
+Finally, the variable elimination algorithm requires an ordering over the variables according to which variables will be "eliminated". In our chain example, we took the ordering implied by the DAG. It is important note that:
+
+- Different ordering dramatically alter the running time of the variable elimination algorithm.
+- It is NP-hard to find the best ordering.
+
+We will come back to these complications later, but for now let the ordering be fixed.
+
+### The variable elimination algorithm
+
+We are now ready to formally define the variable elimination (VE) algorithm.
+Essentially, we loop over the variables as ordered by $$O$$ and eliminate them in that ordering. Intuitively, this corresponds to choosing a sum and "pushing it in" as far as possible inside the product of the factors, as we did in the chain example.
+
+More formally, for each variable $$X_i$$ (ordered according to $$O$$),
+
+1. Multiply all factors $$\Phi_i$$ containing $$X_i$$
+2. Marginalize out $$X_i$$ to obtain new factor $$\tau$$
+3. Replace the factors in $$\Phi_i$$ by $$\tau$$
+
+### Examples
+
+Let's try to understand what these steps correspond to in our chain example. In that case, the chosen ordering was $$x_1, x_2, ..., x_{n-1}.$$
+Starting with $$x_1$$, we collected all the factors involving $$x_1$$, which were $$p(x_1)$$ and $$p(x_2 \mid x_1)$$. We then used them to construct a new factor $$\tau(x_2) = \sum_{x_1} p(x_2 \mid x_1) p(x_1)$$. This can be seen as the results of steps 2 and 3 of the VE algorithm: first we form a large factor $$\sigma(x_2, x_1) = p(x_2 \mid x_1) p(x_1)$$; then we eliminate $$x_1$$ from that factor to produce $$\tau$$. Then, we repeat the same procedure for $$x_2$$, except that the factors are now $$p(x_3 \mid x_2), \tau(x_2)$$.
+
+For a slightly more complex example, recall the graphical model of a student's grade that we introduced earlier.{% marginfigure 'nb1' 'assets/img/grade-model.png' "Bayes net model of a student's grade $$g$$ on an exam; in addition to $$g$$, we also model other aspects of the problem, such as the exam's difficulty $$d$$, the student's intelligence $$i$$, his SAT score $$s$$, and the quality $$l$$ of a reference letter from the professor who taught the course. Each variable is binary, except for $$g$$, which takes 3 possible values."%}
+The probability specified by the model is of the form
+{% math %}
+p(l, g, i, d, s) = p(l \mid  g) p(g \mid  i, d) p(i) p(d) p(s\mid d).
+{% endmath %}
+Let's suppose that we are computing $$p(l)$$ and are eliminating variables in their topolgical ordering in the graph. First, we eliminate $$d$$, which corresponds to creating a new factor $$\tau_1(g,i) = \sum_{d} p(g \mid  i, d)p(d)$$. Next, we eliminate $$i$$ to produce a factor $$\tau_2(g,s) = \sum_{d} \tau_1(g,i) p(i) p(s \mid i)$$; then we eliminate $$g$$ yielding $$\tau_2(g) = \sum_{d} \tau_2(g,d)$$, and so forth. Note that these operations are equivalent to summing out the factored probability distribution as follows:
+{% math %}
+p(l) = \sum_{g} p(l \mid  g) \sum_{s} p(s\mid d) \sum_{i} p(i) \sum_{d} p(g \mid  i, d) p(d) .
+{% endmath %}
+Note that this example requires computing at most $$d^3$$ operations per step, since each factor is at most over 2 variables, and one variable is summed out at each step (the dimensionality $$d$$ in this example is either 2 or 3).
+
+## Introducing evidence
+
+A closely related and equally important problem is computing conditional probabilities of the form
+{% math %}
+P(Y \mid E = e) = \frac{P(Y, E = e)}{P(E=e)},
+{% endmath %}
+where $$P(X,Y,E)$$ is a probability distribution, over sets of query variables $$Y$$, observed evidence variables $$E$$, and unobserved variables $$X$$.
+
+We can compute this probability by performing variable elimination once on $$P(Y, E = e)$$ and then once more on $$P(E = e)$$.
+
+To compute $$P(Y, E = e)$$, we simply take every factor $$\phi(X', Y', E')$$ which has scope over variables $$E' \subseteq E$$ that are also found in $$E$$, and we set their values as specified by $$e$$. Then we perform standard variable elimination over $$X$$ to obtain a factor over only $$Y$$.
+
+## Running Time of Variable Elimination
+
+It is very important to understand that the running time of Variable Elimination depends heavily on the structure of the graph.
+
+In the previous example, suppose we eliminated $$g$$ first. Then, we would have had to transform the factors $$p(g \mid i, d), \phi(l \mid g)$$ into a big factor $$\tau(d, i, l)$$ over 3 variables, which would require $$O(d^4)$$ time to compute. If we had a factor $$S \rightarrow G$$, then we would have had to eliminate $$p(g \mid s)$$ as well, producing a single giant factor $$\tau(d, i, l, s)$$ in $$O(d^5)$$ time. Then, eliminating any variable from this factor would require almost as much work as if we had started with the original distribution, since all the variable have become coupled.
+
+Clearly some ordering are much more efficient than others. In fact, the running time of Variable Elimination will equal $$O(m d^M)$$, where $$M$$ is the maximum size of any factor during the elimination process.
+
+### Choosing variable elimination orderings
+
+Unfortunately, choosing the optimal VE ordering is an NP-hard problem. However, in practice, we may resort to the following heurstics:
+
+- *Min-neighbors*: Choose a variable with the fewest dependent variables.
+- *Min-weight*: Choose variables to minimize the product of the cardinalities of its dependent variables.
+- *Min-fill*: Choose vertices to minimize the size of the factor that will be added to the graph.
+
+In practice, these methods often result in reasonably good performance in many interesting settings.
diff --git a/preliminaries/introduction/index.md b/preliminaries/introduction/index.md
@@ -4,7 +4,7 @@ title: Introduction
 ---
 Probabilistic graphical modeling is a field of AI that studies how to model real-world phenomena using probability distributions and use these models to make useful predictions about the future.
 
-Building probabilistic models turns out to be a complex and fascinating problem. From a more academic point of view, this field builds on a beautiful theory that bridges two very different fields of mathematics: statistics --- which forms the core of modern machine learning and data analysis --- as well as discrete math --- particularly graph theory and combinatorics. The field also has intriguing connections to philosophy, especially the question of causality.
+Building probabilistic models turns out to be a complex and fascinating problem. From a more academic point of view, this field builds on a beautiful theory that bridges two very different fields of mathematics: probability theory --- which, along with statistics, forms the core of modern machine learning and data analysis --- as well as discrete math --- particularly graph theory and combinatorics. The field also has intriguing connections to philosophy, especially the question of causality.
 
 Probabilistic modeling is also deeply grounded in reality and has countless real-world applications in fields as diverse as medicine, language processing, vision, physics, and many others. 
 It is very likely that at least half a dozen applications currently running on your computer are using graphical models internally.