class: middle, center, title-slide
Lecture 3: Neural Networks
Linear regression problem: $ y = Wx +b $
Use mean squared error to judge how good predictions are and adjust
Predict binary outcome (zero or one) from input variable(s).
Use a linear classifier.
What we’d really like is class probabilities.
What we’d really like is class probabilities.
Normalise elements of
Linear classifier (or logistic regression): $$ \mathbf{y} = \mathsf{softmax} ( \mathbf{W} x + \mathbf{b} ) $$
We have:
- function
$f(x; W, b)$ that we can use to make predictions$\hat{y}$ $f(x; W, b) = \mathsf{softmax}(\mathbf{W} x + \mathbf{b})$
- a loss function to measure how well our model is doing
$L(\hat{y}, y)$ - log-loss or cross-entropy:
$y\log p + (1-y) \log(1-p)$
- log-loss or cross-entropy:
- find optimal values of
$\mathbf{W}$ and$\mathbf{b}$ by gradient descent- compute gradient of
$L$ w.r.t. each parameter$\frac{dL}{dW}$ and$\frac{dL}{db}$ - update parameters
$W \leftarrow W + \alpha \frac{dL}{dW}$ and$b \leftarrow b + \alpha \frac{dL}{db}$
- compute gradient of
- linear-regression.ipynb
- logistic-regression.ipynb
A (very) complicated mathematical function. That is it.
Takes in a collection of numbers (pixel intensities) and outputs numbers (class probabilities).
.larger.center[
f()
Neural networks can have several layers.
- final layer performs logistic regression
- all previous layers transform the data from one representation to another.
Blackboard!
--
.footnote[From https://documents.epfl.ch/users/f/fl/fleuret/www/dlc/]
.footnote[From https://documents.epfl.ch/users/f/fl/fleuret/www/dlc/]
.footnote[From https://documents.epfl.ch/users/f/fl/fleuret/www/dlc/]
.footnote[From https://documents.epfl.ch/users/f/fl/fleuret/www/dlc/]
Use fixed feature extractor and tune last step using supervised learning.
Adjust parameters of every step using supervised learning.
--
Ideal for problems where you do not know a good representation of the data.
Recent resurgence of neural networks thanks to deep learning. Isn’t it all just hype?
- Often just "old algorithms".
- GPUs! Vastly more computing power available today.
- New ideas and understanding.
- More data with labels.
Key point: feed raw features to algorithm, learn everything else.
neural-networks-as-feature-extractors.ipynb
Standard Dense Layer for an image input:
x = Input((640, 480, 3), dtype='float32')
# shape of x is: (None, 640, 480, 3)
x = Flatten()(x)
# shape of x is: (None, 640 x 480 x 3)
z = Dense(1000)(x)
How many parameters in the Dense layer?
--
Spatial organization of the input is destroyed by Flatten
The solution is convolutional layers.
.footnote[From https://github.com/m2dsupsdlclass/lectures-labs]
.footnote[ LeCun, Y., Bottou, L., Bengio, Y., and Haffner, P. (1998). Gradient-based learning applied to document recognition. ]
- slide a small (
$3 \times 3$ ) window over the image ($5 \times 5$ ) - several filters (neurons) per convolutional layer
- each output neuron is parametrised with a
$3 \times 3$ weight matrix$\mathbf{w}$
.footnote[ These slides use convolution visualisations by V. Dumoulin available at https://github.com/vdumoulin/conv_arithmetic]
The activation is obtained by sliding the
at each step. Where
.footnote[From https://github.com/m2dsupsdlclass/lectures-labs]
Local connectivity:
- Output depends only on a few local inputs
- Translational invariance
Compared to Fully connected/Dense:
- Parameter sharing, reduced number of parameters
- Make use of spatial structure: a good assumption for vision!
.footnote[From https://github.com/m2dsupsdlclass/lectures-labs]
input_image = Input(shape=(28, 28, 3))
*x = Conv2D(32, 5, activation='relu')(input_image)
*x = MaxPool2D(2, strides=2)(x)
*x = Conv2D(64, 3, activation='relu')(x)
*x = MaxPool2D(2, strides=2)(x)
x = Flatten()(x)
x = Dense(256, activation='relu')(x)
x = Dense(10, activation='softmax')(x)
convnet = Model(inputs=input_image, outputs=x)
Two layers of convolution and pooling implemented using keras.
.footnote[From https://github.com/m2dsupsdlclass/lectures-labs]
image-convolutions-with-keras.ipynb
Coloured image = tensor of shape (height, width, channels)
Convolutions are usually computed for each channel and summed:
.center.width-50[
]
.footnote[From https://github.com/m2dsupsdlclass/lectures-labs]
.center.width-50[
]
.footnote[From https://github.com/m2dsupsdlclass/lectures-labs]
.center.width-50[
]
.footnote[From https://github.com/m2dsupsdlclass/lectures-labs]
.center.width-50[
]
.footnote[From https://github.com/m2dsupsdlclass/lectures-labs]
.center.width-50[
]
.footnote[From https://github.com/m2dsupsdlclass/lectures-labs]
input_image = Input(shape=(28, 28, 3))
x = Conv2D(4, 5, activation='relu')(input_image)
.footnote[From https://github.com/m2dsupsdlclass/lectures-labs]
- Strides: increment step size for the convolution operator
- Reduces the size of the ouput map
.center.small[
Example with kernel size
.footnote[From https://github.com/m2dsupsdlclass/lectures-labs]
- Padding: artifically fill borders of image
- Useful to keep spatial dimension constant across filters
- Useful with strides and large receptive fields
- Usually: fill with 0s
.footnote[From https://github.com/m2dsupsdlclass/lectures-labs]
Kernel or Filter shape
.left-column[
-
$F \times F$ kernel size, -
$C^i$ input channels -
$C^o$ output channels ]
.right-column[
.width-40.center[]
]
--
.reset-column[ ]
Number of parameters:
--
Activations or Feature maps shape:
- Input
$(W^i, H^i, C^i)$ - Output
$(W^o, H^o, C^o)$
???
The filters hold the trainable parameters of the model (excluding the biases).
The feature maps are the outputs (or activations) of convolution layers when applied to a specific batch of images.
- Spatial dimension reduction
- Local invariance
- No parameters: typically maximum or average of 2x2 units
.footnote[Schematic from Stanford http://cs231n.github.io/convolutional-networks]
- Drop out - a good way to regularise your network
- Batch Norm - normalise the data at each layer of the network
.footnote[ LeCun, Y., Bottou, L., Bengio, Y., and Haffner, P. (1998). Gradient-based learning applied to document recognition. ]
- Input
--
-
Conv blocks
- Convolution + activation (relu)
- Convolution + activation (relu)
- ...
- Maxpooling 2x2
(repeat these a few times)
--
- Output
- Fully connected layers
- Softmax
.footnote[Simonyan, Karen, and Zisserman. "Very deep convolutional networks for large-scale image recognition." (2014)]
.smaller[
model.add(Convolution2D(64, 3, 3, activation='relu',input_shape=(3,224,224)))
model.add(Convolution2D(64, 3, 3, activation='relu'))
model.add(MaxPooling2D((2,2), strides=(2,2)))
model.add(Convolution2D(128, 3, 3, activation='relu'))
model.add(Convolution2D(128, 3, 3, activation='relu'))
model.add(MaxPooling2D((2,2), strides=(2,2)))
model.add(Convolution2D(256, 3, 3, activation='relu'))
model.add(Convolution2D(256, 3, 3, activation='relu'))
model.add(Convolution2D(256, 3, 3, activation='relu'))
model.add(MaxPooling2D((2,2), strides=(2,2)))
model.add(Convolution2D(512, 3, 3, activation='relu'))
model.add(Convolution2D(512, 3, 3, activation='relu'))
model.add(Convolution2D(512, 3, 3, activation='relu'))
model.add(MaxPooling2D((2,2), strides=(2,2)))
model.add(Convolution2D(512, 3, 3, activation='relu'))
model.add(Convolution2D(512, 3, 3, activation='relu'))
model.add(Convolution2D(512, 3, 3, activation='relu'))
model.add(MaxPooling2D((2,2), strides=(2,2)))
model.add(Flatten())
model.add(Dense(4096, activation='relu'))
model.add(Dropout(0.5))
model.add(Dense(4096, activation='relu'))
model.add(Dropout(0.5))
model.add(Dense(1000, activation='softmax'))
]
.left-column[
]
.footnote[ .left-column[ He, Kaiming, et al. "Deep residual learning for image recognition." CVPR. 2016. ] ]
.right-column[
Even deeper models:
34, 50, 101, 152 layers
.left-column[
]
.footnote[ .left-column[ He, Kaiming, et al. "Deep residual learning for image recognition." CVPR. 2016. ] ]
.right-column[
.left-column[ A block learns the residual with respect to identity:
- Good optimization properties ]
.footnote[ from Kaiming He slides "Deep residual learning for image recognition." ICML. 2016. ]
Require millions of images and days or weeks of GPU time to train. Don't usually have either. What to do?
- Treat a whole network as a "feature transformer"
- Use the last or second to last layer as input features to a logistic regression or a small neural network which is trained on our small dataset
- teachable machine demo
How good are you compared to a computer at quickly identifying cats vs dogs?
--
.footnote[Elsayed et al, https://arxiv.org/abs/1802.08195]
How good are you compared to a computer at quickly identifying cats vs dogs?
What is the left picture? What is the right picture?
.footnote[Moosavi-Dezfooli et al, https://arxiv.org/abs/1610.08401v1]
.footnotes[Deep neural networks are easily fooled: High confidence predictions for unrecognizable images, http://www.evolvingai.org/fooling]
- Convnet on fashion MNIST
- transfer learning on road bike dataset
- what do you want to do as project?
Fin.