class: middle, center, title-slide
Lecture 6: Recurrent neural networks
Prof. Gilles Louppe
[email protected]
How to make sense of sequential data?
- Temporal convolutions
- Recurrent neural networks
- Applications
- Beyond sequences
class: middle
Many real-world problems require to process a signal with a sequence structure.
- Sequence classification:
- sentiment analysis
- activity/action recognition
- DNA sequence classification
- action selection
- Sequence synthesis:
- text synthesis
- music synthesis
- motion synthesis
- Sequence-to-sequence translation:
- speech recognition
- text translation
- part-of-speech tagging
.footnote[Credits: Francois Fleuret, EE559 Deep Learning, EPFL.]
class: middle
Given a set
.grid.center[ .kol-1-2.bold[Sequence classification] .kol-1-2[$f: S(\mathcal{X}) \to \\{ 1, ..., C\\}$] ] .grid.center[ .kol-1-2.bold[Sequence synthesis] .kol-1-2[$f: \mathbb{R}^d \to S(\mathcal{X})$] ] .grid.center[ .kol-1-2.bold[Sequence-to-sequence translation] .kol-1-2[$f: S(\mathcal{X}) \to S(\mathcal{Y})$] ]
In the rest of the slides, we consider only time-indexed signal, although it generalizes to arbitrary sequences.
.footnote[Credits: Francois Fleuret, EE559 Deep Learning, EPFL.]
class: middle
class: middle
The simplest approach to sequence processing is to use temporal convolutional networks (TCNs).
TCNs correspond to standard 1D convolutional networks. They process input sequences as fixed-size vectors of the maximum possible length.
.footnote[Credits: Francois Fleuret, EE559 Deep Learning, EPFL.]
class: middle
Increasing exponentially the kernel sizes makes the required number of layers grow as
Dilated convolutions make the model size grow as
.footnote[Credits: Philippe Remy, keras-tcn, 2018; Francois Fleuret, EE559 Deep Learning, EPFL.]
class: middle
.footnote[Credits: Bai et al, An Empirical Evaluation of Generic Convolutional and Recurrent Networks for Sequence Modeling, 2018.]
class: middle
class: middle
When the input is a sequence
.footnote[Credits: Francois Fleuret, EE559 Deep Learning, EPFL.]
class: middle
Formally, for
Predictions can be computed at any time step
.footnote[Credits: Francois Fleuret, EE559 Deep Learning, EPFL.]
class: middle
count: false class: middle
count: false class: middle
count: false class: middle
class: middle
Even though the number of steps
In the case of recurrent neural networks, this is referred to as backpropagation through time.
class: middle
Elman networks consist of
where
class: middle
Learn to recognize variable-length sequences that are palindromes.
For training, we will use sequences of random sizes, from
.grid[
.kol-1-4[]
.kol-1-4.center[
class: middle
Recurrent networks can be viewed as layers producing sequences
As for dense layers, recurrent layers can be composed in series to form a .bold[stack] of recurrent networks.
class: middle
Computing the recurrent states forward in time does not make use of future input values
- RNNs can be made bidirectional by consuming the sequence in both directions.
- Effectively, this amounts to run the same (single direction) RNN twice:
- once over the original sequence
$\mathbf{x}_{1:T}$ , - once over the reversed sequence
$\mathbf{x}_{T:1}$ .
- once over the original sequence
- The resulting recurrent states of the bidirectional RNN is the concatenation of two resulting sequences of recurrent states.
class: middle
When unfolded through time, the graph of computation of a recurrent network can grow very deep, and training involves dealing with vanishing gradients.
- RNN cells should include a pass-through, or additive paths, so that the recurrent state does not go repeatedly through a squashing non-linearity.
- This is identical to skip connections in ResNet.
.footnote[Credits: Francois Fleuret, EE559 Deep Learning, EPFL.]
class: middle
For instance, the recurrent state update can be a per-component weighted average of its previous value
Formally, $$ \begin{aligned} \bar{\mathbf{h}}_t &= \phi(\mathbf{x}_t, \mathbf{h}_{t-1};\theta) \\ \mathbf{z}_t &= f(\mathbf{x}_t, \mathbf{h}_{t-1};\theta) \\ \mathbf{h}_t &= \mathbf{z}_t \odot \mathbf{h}_{t-1} + (1-\mathbf{z}_t) \odot \bar{\mathbf{h}}_t. \end{aligned} $$
.footnote[Credits: Francois Fleuret, EE559 Deep Learning, EPFL.]
class: middle
class: middle
The long short-term memory model (LSTM; Hochreiter and Schmidhuber, 1997) is an instance of the previous gated recurrent cell, with the following changes:
- The recurrent state is split into two parts
$\mathbf{c}_t$ and$\mathbf{h}_t$ , where-
$\mathbf{c}_t$ is the cell state and -
$\mathbf{h}_t$ is output state.
-
- A forget gate
$\mathbf{f}_t$ selects the cell state information to erase. - An input gate
$\mathbf{i}_t$ selects the cell state information to update. - An output gate
$\mathbf{o}_t$ selects the cell state information to output.
class: middle
class: middle
The gated recurrent unit (GRU; Cho et al, 2014) is another gated recurrent cell.
- It uses two gates instead of three: an update gate
$\mathbf{z}_t$ and a reset gate$\mathbf{r}_t$ . - GRUs perform similarly as LSTMs for language or speech modeling sequences, but with fewer parameters.
- However, LSTMs remain strictly stronger than GRUs.
class: middle
class: middle
class: middle
.center[The models do not generalize to sequences longer than those in the training set!]
Gated units prevent gradients from vanishing, but not from exploding.
.footnote[Credits: pat-coady.]
class: middle
The standard strategy to solve this issue is gradient norm clipping, which rescales the norm of the gradient to a fixed threshold
.footnote[Credits: Francois Fleuret, EE559 Deep Learning, EPFL.]
Let us consider a simplified RNN, with no inputs, no bias, an identity activation function
We have, $$ \begin{aligned} \mathbf{h}_t &= \sigma\left( \mathbf{W}^T_{xh} \mathbf{x}_t + \mathbf{W}^T_{hh} \mathbf{h}_{t-1} + \mathbf{b}_h \right) \\ &= \mathbf{W}^T_{hh} \mathbf{h}_{t-1} \\ &= \mathbf{W}^T \mathbf{h}_{t-1}. \end{aligned} $$
For a sequence of size
Ideally, we would like
class: middle
The Fibonacci sequence is
class: middle
In matrix form, the Fibonacci sequence is equivalently expressed as
With $\mathbf{f}_0 = \begin{pmatrix}
1 \\
0
\end{pmatrix}$, we have
class: middle
The matrix
In particular,
Therefore, the Fibonacci sequence grows exponentially fast with the golden ratio
class: middle
Let
We have:
- if
$\rho(\mathbf{A}) < 1$ then$\lim_{n\to\infty} ||\mathbf{A}^n|| = \mathbf{0}$ (= vanishing activations), - if
$\rho(\mathbf{A}) > 1$ then$\lim_{n\to\infty} ||\mathbf{A}^n|| = \infty$ (= exploding activations).
class: middle
.center[
.footnote[Credits: Stephen Merety, Explaining and illustrating orthogonal initialization for recurrent neural networks, 2016.]
class: middle
.center[
.footnote[Credits: Stephen Merety, Explaining and illustrating orthogonal initialization for recurrent neural networks, 2016.]
class: middle
If
- Therefore, initializing
$\mathbf{W}$ as a random orthogonal matrix will guarantee that activations will neither vanish nor explode. - In practice, a random orthogonal matrix can be found through the SVD decomposition or the QR factorization of a random matrix.
- This initialization strategy is known as orthogonal initialization.
class: middle
In Tensorflow's Orthogonal
initializer:
# Generate a random matrix
a = random_ops.random_normal(flat_shape, dtype=dtype, seed=self.seed)
# Compute the qr factorization
q, r = gen_linalg_ops.qr(a, full_matrices=False)
# Make Q uniform
d = array_ops.diag_part(r)
q *= math_ops.sign(d)
if num_rows < num_cols:
q = array_ops.matrix_transpose(q)
return self.gain * array_ops.reshape(q, shape)
.footnote[Credits: Tensorflow, tensorflow/python/ops/init_ops.py.]
class: middle
.center[
.footnote[Credits: Stephen Merety, Explaining and illustrating orthogonal initialization for recurrent neural networks, 2016.]
class: middle
Exploding activations are also the reason why squashing non-linearity functions (such as
- They avoid recurrent states from exploding by upper bounding
$||\mathbf{h}_t||$ . - (At least when running the network forward.)
???
https://github.com/nyu-dl/NLP_DL_Lecture_Note/blob/master/lecture_note.pdf
class: middle
(some)
class: middle
Document-level modeling for sentiment analysis (= text classification),
with stacked, bidirectional and gated recurrent networks.
]
.footnote[Credits: Duyu Tang et al, Document Modeling with Gated Recurrent Neural Network for Sentiment Classification, 2015.]
class: middle
Model language as a Markov chain, such that sentences are sequences of words
class: middle
.footnote[Credits: Alex Graves, Generating Sequences With Recurrent Neural Networks, 2013.]
class: middle
.footnote[Credits: Max Woolf, 2018.]
class: middle
The same generative architecture applies to any kind of sequences.
E.g., sketch-rnn-demo
sketches defined as sequences of strokes.
class: middle
.footnote[Credits: Yonghui Wu et al, Google’s Neural Machine Translation System: Bridging the Gap between Human and Machine Translation, 2016.]
class: middle
.footnote[Credits: Yonghui Wu et al, Google’s Neural Machine Translation System: Bridging the Gap between Human and Machine Translation, 2016.]
class: middle
.footnote[Image credits: Shen et al, 2017. arXiv:1712.05884.]
class: middle, black-slide
.center[
<iframe width="640" height="400" src="https://www.youtube.com/embed/Ipi40cb_RsI?&loop=1&start=0" frameborder="0" volume="0" allowfullscreen></iframe> ]A recurrent network playing Mario Kart.
class: middle
class: middle
.italic[ An increasingly large number of .bold[people are defining the networks procedurally in a data-dependent way (with loops and conditionals)], allowing them to change dynamically as a function of the input data fed to them. It's really .bold[very much like a regular program, except it's parameterized]. ]
.pull-right[Yann LeCun (Director of AI Research, Facebook, 2018)]
.center[Any Turing machine can be simulated by a recurrent neural network
(Siegelmann and Sontag, 1995)]
???
This implies that usual programs can all be equivalently implemented as a neural network, in theory.
class: middle
Networks can be coupled with memory storage to produce neural computers:
- The controller processes the input sequence and interacts with the memory to generate the output.
- The read and write operations attend to all the memory addresses.
???
- A Turing machine is not very practical to implement useful programs.
- However, we can simulate the general architecture of a computer in a neural network.
class: middle
A differentiable neural computer being trained to store and recall dense binary numbers. Upper left: the input (red) and target (blue), as 5-bit words and a 1 bit interrupt signal. Upper right: the model's output
The topology of a recurrent network unrolled through time is dynamic.
It depends on:
- the input sequence and its size
- a graph construction algorithms which consumes input tokens in sequence to add layers to the graph of computation.
This principle generalizes to:
- arbitrarily structured data (e.g., sequences, trees, graphs)
- arbitrary graph construction algorithm that traverses these structures (e.g., including for-loops or recursive calls).
class: middle
.center[
.width-50[] .width-60[
]
]
Even though the graph topology is dynamic, the unrolled computation is fully differentiable. The program is trainable.
.footnote[Credits: Henrion et al, 2017.]
class: middle
.footnote[Credits: Shi and Rajkumar, Point-GNN, 2020.]
class: middle
.footnote[Credits: Schutt et al, 2017.]
???
quantum-mechanical properties of molecular systems
class: middle
.footnote[Credits: Sanchez-Gonzalez et al, 2020.]
class: middle, black-slide
.center[
.footnote[Credits: Sanchez-Gonzalez et al, 2020.]
class: end-slide, center count: false
The end.
count: false
- Kyunghyun Cho, "Natural Language Understanding with Distributed Representation", 2015.