class: middle, center, title-slide

Deep Learning

Lecture 11: Theory of deep learning

Prof. Gilles Louppe
[email protected]

???

R: move out the GP part into a new lecture. R: cover neural tangents there https://rajatvd.github.io/NTK/ R: science of dl https://people.csail.mit.edu/madry/6.883/

mysteries of deep learning -> better generalization than they should (over-param) -> lottery ticket -> adversarial examples http://introtodeeplearning.com/materials/2019_6S191_L6.pdf

R: check generalization from https://m2dsupsdlclass.github.io/lectures-labs/slides/08_expressivity_optimization_generalization/index.html#87

Universal approximation

.bold[Theorem.] (Cybenko 1989; Hornik et al, 1991) Let $\sigma(\cdot)$ be a bounded, non-constant continuous function. Let $I_p$ denote the $p$-dimensional hypercube, and $C(I_p)$ denote the space of continuous functions on $I_p$. Given any $f \in C(I_p)$ and $\epsilon > 0$, there exists $q > 0$ and $v_i, w_i, b_i, i=1, ..., q$ such that $$F(x) = \sum_{i \leq q} v_i \sigma(w_i^T x + b_i)$$ satisfies $$\sup_{x \in I_p} |f(x) - F(x)| < \epsilon.$$

class: middle

The universal approximation theorem

guarantees that even a single hidden-layer network can represent any classification problem in which the boundary is locally linear (smooth);
does not inform about good/bad architectures, nor how they relate to the optimization procedure.
generalizes to any non-polynomial (possibly unbounded) activation function, including the ReLU (Leshno, 1993).

class: middle

.bold[Theorem] (Barron, 1992) The mean integrated square error between the estimated network $\hat{F}$ and the target function $f$ is bounded by $$O\left(\frac{C^2_f}{q} + \frac{qp}{N}\log N\right)$$ where $N$ is the number of training points, $q$ is the number of neurons, $p$ is the input dimension, and $C_f$ measures the global smoothness of $f$.

Combines approximation and estimation errors.
Provided enough data, it guarantees that adding more neurons will result in a better approximation.

class: middle

Let us consider the 1-layer MLP $$f(x) = \sum w_i \text{ReLU}(x + b_i).$$
This model can approximate any smooth 1D function, provided enough hidden units.

.center[]

class: middle count: false

Let us consider the 1-layer MLP $$f(x) = \sum w_i \text{ReLU}(x + b_i).$$
This model can approximate any smooth 1D function, provided enough hidden units.

.center[]

class: middle count: false

Let us consider the 1-layer MLP $$f(x) = \sum w_i \text{ReLU}(x + b_i).$$
This model can approximate any smooth 1D function, provided enough hidden units.

.center[]

class: middle count: false

Let us consider the 1-layer MLP $$f(x) = \sum w_i \text{ReLU}(x + b_i).$$
This model can approximate any smooth 1D function, provided enough hidden units.

.center[]

class: middle count: false

Let us consider the 1-layer MLP $$f(x) = \sum w_i \text{ReLU}(x + b_i).$$
This model can approximate any smooth 1D function, provided enough hidden units.

.center[]

class: middle count: false

Let us consider the 1-layer MLP $$f(x) = \sum w_i \text{ReLU}(x + b_i).$$
This model can approximate any smooth 1D function, provided enough hidden units.

.center[]

class: middle count: false

Let us consider the 1-layer MLP $$f(x) = \sum w_i \text{ReLU}(x + b_i).$$
This model can approximate any smooth 1D function, provided enough hidden units.

.center[]

class: middle count: false

Let us consider the 1-layer MLP $$f(x) = \sum w_i \text{ReLU}(x + b_i).$$
This model can approximate any smooth 1D function, provided enough hidden units.

.center[]

class: middle count: false

Let us consider the 1-layer MLP $$f(x) = \sum w_i \text{ReLU}(x + b_i).$$
This model can approximate any smooth 1D function, provided enough hidden units.

.center[]

class: middle count: false

Let us consider the 1-layer MLP $$f(x) = \sum w_i \text{ReLU}(x + b_i).$$
This model can approximate any smooth 1D function, provided enough hidden units.

.center[]

class: middle count: false

Let us consider the 1-layer MLP $$f(x) = \sum w_i \text{ReLU}(x + b_i).$$
This model can approximate any smooth 1D function, provided enough hidden units.

.center[]

class: middle count: false

Let us consider the 1-layer MLP $$f(x) = \sum w_i \text{ReLU}(x + b_i).$$
This model can approximate any smooth 1D function, provided enough hidden units.

.center[]

class: middle count: false

Let us consider the 1-layer MLP $$f(x) = \sum w_i \text{ReLU}(x + b_i).$$
This model can approximate any smooth 1D function, provided enough hidden units.

.center[]

Effect of depth

.center.width-80[]

.bold[Theorem] (Montúfar et al, 2014) A rectifier neural network with $p$ input units and $L$ hidden layers of width $q \geq p$ can compute functions that have $\Omega((\frac{q}{p})^{(L-1)p} q^p)$ linear regions.

That is, the number of linear regions of deep models grows exponentially in $L$ and polynomially in $q$.
Even for small values of $L$ and $q$, deep rectifier models are able to produce substantially more linear regions than shallow rectifier models.