Skip to content

Commit

Permalink
Add previous/next links
Browse files Browse the repository at this point in the history
  • Loading branch information
liaoruowang committed Apr 28, 2017
1 parent 5036638 commit cdaa85d
Show file tree
Hide file tree
Showing 17 changed files with 75 additions and 3 deletions.
5 changes: 5 additions & 0 deletions extras/vae/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -203,3 +203,8 @@ We may interpret the variational autoencoder as a directed latent-variable proba
The VAE can be applied to images $$x$$ in order to learn interesting latent representations. The VAE paper contains a few examples on the Frey face dataset on the MNIST digits. On the face dataset, we can interpolate between facial expressions by interpolating between latent variables (e.g. we can generate smooth transitions between "angry" and "surprised"). On the MNIST dataset, we can similarly interpolate between numbers.

The authors also compare their methods against three alternative approaches: the wake-sleep algorithm, Monte-Carlo EM, and hybrid Monte-Carlo. The latter two methods are sampling-based approaches; they are quite accurate, but don't scale well to large datasets. Wake-sleep is a variational inference algorithm that scales much better; however it does not use the exact gradient of the ELBO (it uses an approximation), and hence it is not as accurate as AEVB. The paper illustrates this by plotting learning curves.


<br/>

|[Index](../../) | [Previous](../../learning/structLearn) | [Next]()|
4 changes: 2 additions & 2 deletions index.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,11 +8,11 @@ They are based on Stanford [CS228](http://cs.stanford.edu/~ermon/cs228/index.htm
Although we have written up most of the material, you will probably find several typos. If you do, please let us know, or submit a pull request with your fixes to our [Github repository](https://github.com/ermongroup/cs228-notes).'%}
You too may help make these notes better by submitting your improvements to us via [Github](https://github.com/ermongroup/cs228-notes).

This course starts by introducing probabilistic graphical models from the very basics and concludes by explaining from first principles the [variational auto-encoder](), an important probabilistic model that is also one of the most influential recent results in deep learning.
This course starts by introducing probabilistic graphical models from the very basics and concludes by explaining from first principles the [variational auto-encoder](extras/vae), an important probabilistic model that is also one of the most influential recent results in deep learning.

## Preliminaries

1. [Introduction](preliminaries/introduction/) What is probabilistic graphical modeling? Overview of the course.
1. [Introduction](preliminaries/introduction/): What is probabilistic graphical modeling? Overview of the course.

2. [Review of probability theory](preliminaries/probabilityreview): Probability distributions. Conditional probability. Random variables (*under construction*).

Expand Down
4 changes: 4 additions & 0 deletions inference/jt/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -212,3 +212,7 @@ In general, however, it may not converges and its analysis is still an area of a

We will return to this algorithm later in the course and try to explain it as a special case of *variational inference* algorithms.


<br/>

|[Index](../../) | [Previous](../ve) | [Next](../map)|
4 changes: 4 additions & 0 deletions inference/map/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -233,3 +233,7 @@ A third approach is to use sampling methods (e.g. Metropolis-Hastings) to sample

The idea of simulated annealing is to run a sampling algorithm starting with a high $$t$$, and gradually decrease it, as the algorithm is being run. If the "cooling rate" is sufficiently slow, we are guaranteed to eventually find the mode of our distribution. In practice, however, choosing the rate requires a lot of tuning. This makes simulated annealing somewhat difficult to use in practice.


<br/>

|[Index](../../) | [Previous](../jt) | [Next](../sampling)|
5 changes: 5 additions & 0 deletions inference/sampling/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -224,3 +224,8 @@ This problem will also occur with complicated distributions that have two distin
Another, perhaps more important problem, is that we may not know when to end the burn-in period, even if it is theoretically not too long. There exist many heuristics to determine whether a Markov chain has *mixed*; however, typically these heuristics involve plotting certain quantities and estimating them by eye; even the quantitative measures are not significantly more reliable than this approach.

In summary, even though MCMC is able to sample from the right distribution (which in turn can be used to solve any inference problem), doing so may sometimes require a very long time, and there is no easy way to judge the amount of computation that we need to spend to find a good solution.


<br/>

|[Index](../../) | [Previous](../map) | [Next](../variational)|
5 changes: 5 additions & 0 deletions inference/variational/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -181,3 +181,8 @@ However, this by itself is not very useful. Although \mathbb{M} is convex, it is
We will make this problem feasible by replacing
-->


<br/>

|[Index](../../) | [Previous](../sampling) | [Next](../../learning/directed)|
5 changes: 5 additions & 0 deletions inference/ve/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -145,3 +145,8 @@ Unfortunately, choosing the optimal VE ordering is an NP-hard problem. However,
- *Min-fill*: Choose vertices to minimize the size of the factor that will be added to the graph.

In practice, these methods often result in reasonably good performance in many interesting settings.


<br/>

|[Index](../../) | [Previous](../../representation/undirected) | [Next](../jt)|
3 changes: 3 additions & 0 deletions learning/bayesianlearning/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -68,3 +68,6 @@ which is another Beta distribution with parameters $$(N_{H}+ \alpha_{H},N_{T}+ \
{% maincolumn 'assets/img/beta.png' 'Here the exponents $$(3,2)$$ and $$(30,20)$$ can both be used to encode the belief that $$\theta$$ is $$0.6$$. But the second set of exponents imply a stronger belief as they are based on a larger sample.' %}


<br/>

|[Index](../../) | [Previous](../latent) | [Next](../structLearn)|
5 changes: 5 additions & 0 deletions learning/directed/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -151,3 +151,8 @@ This is essentially the same as the head/tails example we saw earlier (except wi
{% endmath %}

We thus conclude that in Bayesian networks with discrete variables, the maximum-likelihood estimate has a closed-form solution. Even when the variables are not discrete, the task is equally simple: the log-factors are linearly separable, hence the log-likelihood reduces to estimating each of them separately. The simplicity of learning is one of the most convenient features of Bayesian networks.


<br/>

|[Index](../../) | [Previous](../../inference/variational) | [Next](../undirected)|
5 changes: 5 additions & 0 deletions learning/latent/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -193,3 +193,8 @@ From our above discussion, it follows that EM has the following properties:
However, since we optimizing a non-convex objective, we have no guarantee to find the global optimum. In fact, EM in practice converges almost always to a local optimum, and moreover, that optimum heavily depends on the choice of initialization. Different initial $$\theta_0$$ can lead to very different solutions, and so it is very common to use multiple restarts of the algorithm and choose the best one in the end. In fact EM is so sensitive to the choice of initial parameters, that techniques for choosing these parameters are still an active area of research.

In summary, the EM algorithm is a very popular technique for optimizing latent variable models that is also often very effective. Its main downside are its difficulties with local minima.


<br/>

|[Index](../../) | [Previous](../undirected) | [Next](../bayesianlearning)|
5 changes: 5 additions & 0 deletions learning/structLearn/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -62,3 +62,8 @@ In this section, we will briefly introduce two recent algorithms for graph searc
The OS approach, as the name refers, conducts a search over the topological orders and the search over graph space at the same time. The K3 algorithm assumes a topological order in advance and do the search only over the graphs that obey the topological order. When the order specified is a poor one, it may end with a bad graph structure (with a low graph score). The OS algorithm resolves this problem by doing search over orders at the same time. It shifts two adjacent variable in an order at each step and employs the K3 algorithm as a sub-routine.

The ILP approach encodes the graph structure, scoring and the acyclic constraints into a linear programming problem. Thus it can utilize the state-of-art integer programming solver. But this approach requires a bound on the maximum number of node parents in the graph (say to be 4 or 5). Otherwise, the number of constraints in the ILP will explode and the computation will be intractable.


<br/>

|[Index](../../) | [Previous](../bayesianlearning) | [Next](../../extras/vae)|
5 changes: 5 additions & 0 deletions learning/undirected/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -187,3 +187,8 @@ This makes learning CRFs more expensive that learning in MRFs. In practice, howe
To deal with the computational difficulties introduced by the partition function, we may use simpler models in which exact inference is tractable. This was the approach taken in the OCR example introduced in our first discussion of CRFs. More generally, one should try to limit the number of variables or make sure that the model's graph is not too densely connected.

Finally, we would like to add that there exists another popular objective for training CRFs called the max-margin loss, a generalization of the objective for training SVMs. Models trained using this loss are called *structured support vector machines* or *max-margin networks*. This loss is more widely used in practice because it often leads to better generalization, and also it requires only MAP inference to compute the gradient, rather than general (e.g. marginal) inference, which is often more expensive to perform.


<br/>

|[Index](../../) | [Previous](../directed) | [Next](../latent)|
7 changes: 6 additions & 1 deletion preliminaries/applications/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -39,4 +39,9 @@ title: Real World Applications
[Stanford scientists combine satellite data, machine learning to map poverty](http://news.stanford.edu/2016/08/18/combining-satellite-data-machine-learning-to-map-poverty/)

# Error Correcting Codes
![codes](Picture1.png)
![codes](Picture1.png)


<br/>

|[Index](../../) | [Previous](../probabilityreview/) | [Next](../../representation/directed/)|
5 changes: 5 additions & 0 deletions preliminaries/introduction/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -101,3 +101,8 @@ It turns out that inference is a very challenging task. For many probabilities o
### Learning

Our last key task refers to fitting a model to a dataset, which could be for example a large number of labeled examples of spam. By looking at the data, we can infer useful patterns (e.g. which words are found more frequently in spam emails), which we can then use to make predictions about the future. However, we will see that learning and inference are also inherently linked in a more subtle way, since inference will turn out to be a key subroutine that we will repeatedly call within learning algorithms. Also, the topic of learning will feature important connections to the field of computational learning theory --- which deals with questions such as generalization from limited data and overfitting --- as well as to Bayesian statistics --- which tells us (among other things) about how to combine prior knowledge and observed evidence in a principled way.


<br/>

|[Index](../../) | [Previous](../../) | [Next](../probabilityreview)|
4 changes: 4 additions & 0 deletions preliminaries/probabilityreview/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -399,3 +399,7 @@ Here, the key step in showing the equality of the two forms of covariance is in
- If $$X$$ and $$Y$$ are independent, then $$Cov[X, Y] = 0$$.
- If $$X$$ and $$Y$$ are independent, then $$E[f(X)g(Y)] = E[f(X)]E[g(Y)]$$.


<br/>

|[Index](../../) | [Previous](../introduction/) | [Next](../applications/)|
4 changes: 4 additions & 0 deletions representation/directed/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -127,3 +127,7 @@ The cascade-type structures (a,b) are clearly symmetric and the directionality o
If $$G,G'$$ have the same skeleton and the same v-structures, then $$I(G) = I(G').$$

Again, it is easy to understand intuitively why this is true. Two graphs are I-equivalent if the $$d$$-separation between variables is the same. We can flip the directionality of any edge, unless it forms a v-structure, and the $$d$$-connectivity of the graph will be unchanged. We refer the reader to the textbook of Koller and Friedman for a full proof.

<br/>

|[Index](../../) | [Previous](../../preliminaries/applications) | [Next](../undirected)|
3 changes: 3 additions & 0 deletions representation/undirected/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -147,3 +147,6 @@ where $$\phi'_i(y_i) = \phi_i(x,y_i)$$. Using global features only changes the v

This observation may be interpreted in a slightly more general form. If we were to model $$p(x,y)$$ using an MRF (viewed as a single model over $$x, y$$ with normalizing constant $$Z = \sum_{x,y} \tp(x,y)$$), then we need to fit two distributions to the data: $$p(y\mid x)$$ and $$p(x)$$. However, if all we are interested in is predicting $$y$$ given $$x$$, then modeling $$p(x)$$ is unnecessary. In fact, it may be disadvantageous to do so statistically (e.g. we may not have enough data to fit both $$p(y\mid x)$$ and $$p(x)$$; since the models have shared parameters, fitting one may result in the best parameters for the other) and it may not be a good idea computationally (we need to make simplifying assumptions in the distribution so that $$p(x)$$ can be handled tractably). CRFs forgo of this assumption, and often perform better on prediction tasks.

<br/>

|[Index](../../) | [Previous](../directed) | [Next](../../inference/ve)|

0 comments on commit cdaa85d

Please sign in to comment.