Add previous/next links

villanuevab · Apr 28, 2017 · cdaa85d · cdaa85d
1 parent 5036638
commit cdaa85d
Show file tree

Hide file tree

Showing 17 changed files with 75 additions and 3 deletions.
diff --git a/extras/vae/index.md b/extras/vae/index.md
@@ -203,3 +203,8 @@ We may interpret the variational autoencoder as a directed latent-variable proba
 The VAE can be applied to images $$x$$ in order to learn interesting latent representations. The VAE paper contains a few examples on the Frey face dataset on the MNIST digits. On the face dataset, we can interpolate between facial expressions by interpolating between latent variables (e.g. we can generate smooth transitions between "angry" and "surprised"). On the MNIST dataset, we can similarly interpolate between numbers.
 
 The authors also compare their methods against three alternative approaches: the wake-sleep algorithm, Monte-Carlo EM, and hybrid Monte-Carlo. The latter two methods are sampling-based approaches; they are quite accurate, but don't scale well to large datasets. Wake-sleep is a variational inference algorithm that scales much better; however it does not use the exact gradient of the ELBO (it uses an approximation), and hence it is not as accurate as AEVB. The paper illustrates this by plotting learning curves.
+
+
+<br/>
+
+|[Index](../../) | [Previous](../../learning/structLearn) |  [Next]()|
diff --git a/index.md b/index.md
@@ -8,11 +8,11 @@ They are based on Stanford [CS228](http://cs.stanford.edu/~ermon/cs228/index.htm
 Although we have written up most of the material, you will probably find several typos. If you do, please let us know, or submit a pull request with your fixes to our [Github repository](https://github.com/ermongroup/cs228-notes).'%}
 You too may help make these notes better by submitting your improvements to us via [Github](https://github.com/ermongroup/cs228-notes).
 
-This course starts by introducing probabilistic graphical models from the very basics and concludes by explaining from first principles the [variational auto-encoder](), an important probabilistic model that is also one of the most influential recent results in deep learning.
+This course starts by introducing probabilistic graphical models from the very basics and concludes by explaining from first principles the [variational auto-encoder](extras/vae), an important probabilistic model that is also one of the most influential recent results in deep learning.
 
 ## Preliminaries
 
-1. [Introduction](preliminaries/introduction/) What is probabilistic graphical modeling? Overview of the course.
+1. [Introduction](preliminaries/introduction/): What is probabilistic graphical modeling? Overview of the course.
 
 2. [Review of probability theory](preliminaries/probabilityreview): Probability distributions. Conditional probability. Random variables (*under construction*).
 

diff --git a/inference/jt/index.md b/inference/jt/index.md
@@ -212,3 +212,7 @@ In general, however, it may not converges and its analysis is still an area of a
 
 We will return to this algorithm later in the course and try to explain it as a special case of *variational inference* algorithms.
 
+
+<br/>
+
+|[Index](../../) | [Previous](../ve) |  [Next](../map)|
diff --git a/inference/map/index.md b/inference/map/index.md
@@ -233,3 +233,7 @@ A third approach is to use sampling methods (e.g. Metropolis-Hastings) to sample
 
 The idea of simulated annealing is to run a sampling algorithm starting with a high $$t$$, and gradually decrease it, as the algorithm is being run. If the "cooling rate" is sufficiently slow, we are guaranteed to eventually find the mode of our distribution. In practice, however, choosing the rate requires a lot of tuning. This makes simulated annealing somewhat difficult to use in practice.
 
+
+<br/>
+
+|[Index](../../) | [Previous](../jt) |  [Next](../sampling)|
diff --git a/inference/sampling/index.md b/inference/sampling/index.md
@@ -224,3 +224,8 @@ This problem will also occur with complicated distributions that have two distin
 Another, perhaps more important problem, is that we may not know when to end the burn-in period, even if it is theoretically not too long. There exist many heuristics to determine whether a Markov chain has *mixed*; however, typically these heuristics involve plotting certain quantities and estimating them by eye; even the quantitative measures are not significantly more reliable than this approach. 
 
 In summary, even though MCMC is able to sample from the right distribution (which in turn can be used to solve any inference problem), doing so may sometimes require a very long time, and there is no easy way to judge the amount of computation that we need to spend to find a good solution.
+
+
+<br/>
+
+|[Index](../../) | [Previous](../map) |  [Next](../variational)|
diff --git a/inference/variational/index.md b/inference/variational/index.md
@@ -181,3 +181,8 @@ However, this by itself is not very useful. Although \mathbb{M} is convex, it is
 We will make this problem feasible by replacing 
 
 -->
+
+
+<br/>
+
+|[Index](../../) | [Previous](../sampling) |  [Next](../../learning/directed)|
diff --git a/inference/ve/index.md b/inference/ve/index.md
@@ -145,3 +145,8 @@ Unfortunately, choosing the optimal VE ordering is an NP-hard problem. However,
 - *Min-fill*: Choose vertices to minimize the size of the factor that will be added to the graph.
 
 In practice, these methods often result in reasonably good performance in many interesting settings.
+
+
+<br/>
+
+|[Index](../../) | [Previous](../../representation/undirected) |  [Next](../jt)|
diff --git a/learning/bayesianlearning/index.md b/learning/bayesianlearning/index.md
@@ -68,3 +68,6 @@ which is another Beta distribution with parameters $$(N_{H}+ \alpha_{H},N_{T}+ \
 {% maincolumn 'assets/img/beta.png' 'Here the exponents $$(3,2)$$ and $$(30,20)$$ can both be used to encode the belief that $$\theta$$ is $$0.6$$. But the second set of exponents imply a stronger belief as they are based on a larger sample.' %}
 
 
+<br/>
+
+|[Index](../../) | [Previous](../latent) |  [Next](../structLearn)|
diff --git a/learning/directed/index.md b/learning/directed/index.md
@@ -151,3 +151,8 @@ This is essentially the same as the head/tails example we saw earlier (except wi
 {% endmath %}
 
 We thus conclude that in Bayesian networks with discrete variables, the maximum-likelihood estimate has a closed-form solution. Even when the variables are not discrete, the task is equally simple: the log-factors are linearly separable, hence the log-likelihood reduces to estimating each of them separately. The simplicity of learning is one of the most convenient features of Bayesian networks.
+
+
+<br/>
+
+|[Index](../../) | [Previous](../../inference/variational) |  [Next](../undirected)|
diff --git a/learning/latent/index.md b/learning/latent/index.md
@@ -193,3 +193,8 @@ From our above discussion, it follows that EM has the following properties:
 However, since we optimizing a non-convex objective, we have no guarantee to find the global optimum. In fact, EM in practice converges almost always to a local optimum, and moreover, that optimum heavily depends on the choice of initialization. Different initial $$\theta_0$$ can lead to very different solutions, and so it is very common to use multiple restarts of the algorithm and choose the best one in the end. In fact EM is so sensitive to the choice of initial parameters, that techniques for choosing these parameters are still an active area of research.
 
 In summary, the EM algorithm is a very popular technique for optimizing latent variable models that is also often very effective. Its main downside are its difficulties with local minima.
+
+
+<br/>
+
+|[Index](../../) | [Previous](../undirected) |  [Next](../bayesianlearning)|
diff --git a/learning/structLearn/index.md b/learning/structLearn/index.md
@@ -62,3 +62,8 @@ In this section, we will briefly introduce two recent algorithms for graph searc
 The OS approach, as the name refers, conducts a search over the topological orders and the search over graph space at the same time. The K3 algorithm assumes a topological order in advance and do the search only over the graphs that obey the topological order. When the order specified is a poor one, it may end with a bad graph structure (with a low graph score). The OS algorithm resolves this problem by doing search over orders at the same time. It shifts two adjacent variable in an order at each step and employs the K3 algorithm as a sub-routine. 
 
 The ILP approach encodes the graph structure, scoring and the acyclic constraints into a linear programming problem. Thus it can utilize the state-of-art integer programming solver. But this approach requires a bound on the maximum number of node parents in the graph (say to be 4 or 5). Otherwise, the number of constraints in the ILP will explode and the computation will be intractable.
+
+
+<br/>
+
+|[Index](../../) | [Previous](../bayesianlearning) |  [Next](../../extras/vae)|
diff --git a/learning/undirected/index.md b/learning/undirected/index.md
@@ -187,3 +187,8 @@ This makes learning CRFs more expensive that learning in MRFs. In practice, howe
 To deal with the computational difficulties introduced by the partition function, we may use simpler models in which exact inference is tractable. This was the approach taken in the OCR example introduced in our first discussion of CRFs. More generally, one should try to limit the number of variables or make sure that the model's graph is not too densely connected.
 
 Finally, we would like to add that there exists another popular objective for training CRFs called the max-margin loss, a generalization of the objective for training SVMs. Models trained using this loss are called *structured support vector machines* or *max-margin networks*. This loss is more widely used in practice because it often leads to better generalization, and also it requires only MAP inference to compute the gradient, rather than general (e.g. marginal) inference, which is often more expensive to perform.
+
+
+<br/>
+
+|[Index](../../) | [Previous](../directed) |  [Next](../latent)|
diff --git a/preliminaries/applications/index.md b/preliminaries/applications/index.md
@@ -39,4 +39,9 @@ title: Real World Applications
 [Stanford scientists combine satellite data, machine learning to map poverty](http://news.stanford.edu/2016/08/18/combining-satellite-data-machine-learning-to-map-poverty/)
 
 # Error Correcting Codes
-![codes](Picture1.png)
+![codes](Picture1.png)
+
+
+<br/>
+
+|[Index](../../) | [Previous](../probabilityreview/) |  [Next](../../representation/directed/)|
diff --git a/preliminaries/introduction/index.md b/preliminaries/introduction/index.md
@@ -101,3 +101,8 @@ It turns out that inference is a very challenging task. For many probabilities o
 ### Learning
 
 Our last key task refers to fitting a model to a dataset, which could be for example a large number of labeled examples of spam. By looking at the data, we can infer useful patterns (e.g. which words are found more frequently in spam emails), which we can then use to make predictions about the future. However, we will see that learning and inference are also inherently linked in a more subtle way, since inference will turn out to be a key subroutine that we will repeatedly call within learning algorithms. Also, the topic of learning will feature important connections to the field of computational learning theory --- which deals with questions such as generalization from limited data and overfitting --- as well as to Bayesian statistics --- which tells us (among other things) about how to combine prior knowledge and observed evidence in a principled way.
+
+
+<br/>
+
+|[Index](../../) | [Previous](../../) |  [Next](../probabilityreview)|
diff --git a/preliminaries/probabilityreview/index.md b/preliminaries/probabilityreview/index.md
@@ -399,3 +399,7 @@ Here, the key step in showing the equality of the two forms of covariance is in
 - If $$X$$ and $$Y$$ are independent, then $$Cov[X, Y] = 0$$.
 - If $$X$$ and $$Y$$ are independent, then $$E[f(X)g(Y)] = E[f(X)]E[g(Y)]$$.
 
+
+<br/>
+
+|[Index](../../) | [Previous](../introduction/) |  [Next](../applications/)|
diff --git a/representation/directed/index.md b/representation/directed/index.md
@@ -127,3 +127,7 @@ The cascade-type structures (a,b) are clearly symmetric and the directionality o
 If $$G,G'$$ have the same skeleton and the same v-structures, then $$I(G) = I(G').$$
 
 Again, it is easy to understand intuitively why this is true. Two graphs are I-equivalent if the $$d$$-separation between variables is the same. We can flip the directionality of any edge, unless it forms a v-structure, and the $$d$$-connectivity of the graph will be unchanged. We refer the reader to the textbook of Koller and Friedman for a full proof.
+
+<br/>
+
+|[Index](../../) | [Previous](../../preliminaries/applications) |  [Next](../undirected)|
diff --git a/representation/undirected/index.md b/representation/undirected/index.md
@@ -147,3 +147,6 @@ where $$\phi'_i(y_i) = \phi_i(x,y_i)$$. Using global features only changes the v
 
 This observation may be interpreted in a slightly more general form. If we were to model $$p(x,y)$$ using an MRF (viewed as a single model over $$x, y$$ with normalizing constant $$Z = \sum_{x,y} \tp(x,y)$$), then we need to fit two distributions to the data: $$p(y\mid x)$$ and $$p(x)$$. However, if all we are interested in is predicting $$y$$ given $$x$$, then modeling $$p(x)$$ is unnecessary. In fact, it may be disadvantageous to do so statistically (e.g. we may not have enough data to fit both $$p(y\mid x)$$ and $$p(x)$$; since the models have shared parameters, fitting one may result in the best parameters for the other) and it may not be a good idea computationally (we need to make simplifying assumptions in the distribution so that $$p(x)$$ can be handled tractably). CRFs forgo of this assumption, and often perform better on prediction tasks.
 
+<br/>
+
+|[Index](../../) | [Previous](../directed) |  [Next](../../inference/ve)|
Original file line number	Diff line number	Diff line change
Expand Up		@@ -212,3 +212,7 @@ In general, however, it may not converges and its analysis is still an area of a

		We will return to this algorithm later in the course and try to explain it as a special case of variational inference algorithms.


		<br/>

		\|[Index](../../) \| [Previous](../ve) \| [Next](../map)\|
Original file line number	Diff line number	Diff line change
Expand Up		@@ -233,3 +233,7 @@ A third approach is to use sampling methods (e.g. Metropolis-Hastings) to sample

		The idea of simulated annealing is to run a sampling algorithm starting with a high $$t$$, and gradually decrease it, as the algorithm is being run. If the "cooling rate" is sufficiently slow, we are guaranteed to eventually find the mode of our distribution. In practice, however, choosing the rate requires a lot of tuning. This makes simulated annealing somewhat difficult to use in practice.


		<br/>

		\|[Index](../../) \| [Previous](../jt) \| [Next](../sampling)\|
-Original file line number
+Diff line change
@@ Expand Up @@
     We will make this problem feasible by replacing
     -->
+    <br/>
+    |[Index](../../) | [Previous](../sampling) |  [Next](../../learning/directed)|
Original file line number	Diff line number	Diff line change
Expand Up		@@ -68,3 +68,6 @@ which is another Beta distribution with parameters $$(N_{H}+ \alpha_{H},N_{T}+ \
		{% maincolumn 'assets/img/beta.png' 'Here the exponents $$(3,2)$$ and $$(30,20)$$ can both be used to encode the belief that $$\theta$$ is $$0.6$$. But the second set of exponents imply a stronger belief as they are based on a larger sample.' %}


		<br/>

		\|[Index](../../) \| [Previous](../latent) \| [Next](../structLearn)\|
Original file line number	Diff line number	Diff line change
Expand Up		@@ -147,3 +147,6 @@ where $$\phi'_i(y_i) = \phi_i(x,y_i)$$. Using global features only changes the v

		This observation may be interpreted in a slightly more general form. If we were to model $$p(x,y)$$ using an MRF (viewed as a single model over $$x, y$$ with normalizing constant $$Z = \sum_{x,y} \tp(x,y)$$), then we need to fit two distributions to the data: $$p(y\mid x)$$ and $$p(x)$$. However, if all we are interested in is predicting $$y$$ given $$x$$, then modeling $$p(x)$$ is unnecessary. In fact, it may be disadvantageous to do so statistically (e.g. we may not have enough data to fit both $$p(y\mid x)$$ and $$p(x)$$; since the models have shared parameters, fitting one may result in the best parameters for the other) and it may not be a good idea computationally (we need to make simplifying assumptions in the distribution so that $$p(x)$$ can be handled tractably). CRFs forgo of this assumption, and often perform better on prediction tasks.

		<br/>

		\|[Index](../../) \| [Previous](../directed) \| [Next](../../inference/ve)\|