Lecture 3: Neural Networks
Linear regression problem: $ y = Wx +b $
Use mean squared error to judge how good predictions are and adjust
Predict binary outcome (zero or one) from input variable(s).
Use a linear classifier.
What we’d really like is class probabilities.
Normalise elements of
Linear classifier (or logistic regression): $$ \mathbf{y} = \mathsf{softmax} ( \mathbf{W} x + \mathbf{b} ) $$
We have:
- function
$f(x; W, b)$ that we can use to make predictions$\hat{y}$ $f(x; W, b) = \mathsf{softmax}(\mathbf{W} x + \mathbf{b})$
- a loss function to measure how well our model is doing
$L(\hat{y}, y)$ - log-loss or cross-entropy:
$y\log p + (1-y) \log(1-p)$
- find optimal values of
$\mathbf{W}$ and$\mathbf{b}$ by gradient descent- compute gradient of
$L$ w.r.t. each parameter$\frac{dL}{dW}$ and$\frac{dL}{db}$ - update parameters
$W \leftarrow W + \alpha \frac{dL}{dW}$ and$b \leftarrow b + \alpha \frac{dL}{db}$
- compute gradient of
- linear-regression.ipynb
- logistic-regression.ipynb
A (very) complicated mathematical function. That is it.
Takes in a collection of numbers (pixel intensities) and outputs numbers (class probabilities).[
Neural networks can have several layers.
- final layer performs logistic regression
- all previous layers transform the data from one representation to another.
Use fixed feature extractor and tune last step using supervised learning.
Adjust parameters of every step using supervised learning.
Ideal for problems where you do not know a good representation of the data.
Recent resurgence of neural networks thanks to deep learning. Isn’t it all just hype?
- Often just "old algorithms".
- GPUs! Vastly more computing power available today.
- New ideas and understanding.
- More data with labels.
Key point: feed raw features to algorithm, learn everything else.
Standard Dense Layer for an image input:
x = Input((640, 480, 3), dtype='float32')
# shape of x is: (None, 640, 480, 3)
x = Flatten()(x)
# shape of x is: (None, 640 x 480 x 3)
z = Dense(1000)(x)
How many parameters in the Dense layer?
Spatial organization of the input is destroyed by Flatten
The solution is convolutional layers.
- slide a small (
$3 \times 3$ ) window over the image ($5 \times 5$ ) - several filters (neurons) per convolutional layer
- each output neuron is parametrised with a
$3 \times 3$ weight matrix$\mathbf{w}$
The activation is obtained by sliding the
at each step. Where
Local connectivity:
- Output depends only on a few local inputs
- Translational invariance
Compared to Fully connected/Dense:
- Parameter sharing, reduced number of parameters
- Make use of spatial structure: a good assumption for vision!
input_image = Input(shape=(28, 28, 3))
*x = Conv2D(32, 5, activation='relu')(input_image)
*x = MaxPool2D(2, strides=2)(x)
*x = Conv2D(64, 3, activation='relu')(x)
*x = MaxPool2D(2, strides=2)(x)
x = Flatten()(x)
x = Dense(256, activation='relu')(x)
x = Dense(10, activation='softmax')(x)
convnet = Model(inputs=input_image, outputs=x)
Two layers of convolution and pooling implemented using keras.
Coloured image = tensor of shape (height, width, channels)
Convolutions are usually computed for each channel and summed:
input_image = Input(shape=(28, 28, 3))
x = Conv2D(4, 5, activation='relu')(input_image)
- Strides: increment step size for the convolution operator
- Reduces the size of the ouput map
Example with kernel size
- Padding: artifically fill borders of image
- Useful to keep spatial dimension constant across filters
- Useful with strides and large receptive fields
- Usually: fill with 0s
Kernel or Filter shape
$F \times F$ kernel size, -
$C^i$ input channels -
$C^o$ output channels ]
Number of parameters:
Activations or Feature maps shape:
- Input
$(W^i, H^i, C^i)$ - Output
$(W^o, H^o, C^o)$
The filters hold the trainable parameters of the model (excluding the biases).
The feature maps are the outputs (or activations) of convolution layers when applied to a specific batch of images.
- Spatial dimension reduction
- Local invariance
- No parameters: typically maximum or average of 2x2 units
- Drop out - a good way to regularise your network
- Batch Norm - normalise the data at each layer of the network
- Input
Conv blocks
- Convolution + activation (relu)
- Convolution + activation (relu)
- ...
- Maxpooling 2x2
(repeat these a few times)
- Output
- Fully connected layers
- Softmax
model.add(Convolution2D(64, 3, 3, activation='relu',input_shape=(3,224,224)))
model.add(Convolution2D(64, 3, 3, activation='relu'))
model.add(MaxPooling2D((2,2), strides=(2,2)))
model.add(Convolution2D(128, 3, 3, activation='relu'))
model.add(Convolution2D(128, 3, 3, activation='relu'))
model.add(MaxPooling2D((2,2), strides=(2,2)))
model.add(Convolution2D(256, 3, 3, activation='relu'))
model.add(Convolution2D(256, 3, 3, activation='relu'))
model.add(Convolution2D(256, 3, 3, activation='relu'))
model.add(MaxPooling2D((2,2), strides=(2,2)))
model.add(Convolution2D(512, 3, 3, activation='relu'))
model.add(Convolution2D(512, 3, 3, activation='relu'))
model.add(Convolution2D(512, 3, 3, activation='relu'))
model.add(MaxPooling2D((2,2), strides=(2,2)))
model.add(Convolution2D(512, 3, 3, activation='relu'))
model.add(Convolution2D(512, 3, 3, activation='relu'))
model.add(Convolution2D(512, 3, 3, activation='relu'))
model.add(MaxPooling2D((2,2), strides=(2,2)))
model.add(Dense(4096, activation='relu'))
model.add(Dense(4096, activation='relu'))
model.add(Dense(1000, activation='softmax'))
Even deeper models:
34, 50, 101, 152 layers
.left-column[ A block learns the residual with respect to identity:
- Good optimization properties ]
Require millions of images and days or weeks of GPU time to train. Don't usually have either. What to do?
- Treat a whole network as a "feature transformer"
- Use the last or second to last layer as input features to a logistic regression or a small neural network which is trained on our small dataset
- teachable machine demo
How good are you compared to a computer at quickly identifying cats vs dogs?
How good are you compared to a computer at quickly identifying cats vs dogs?
What is the left picture? What is the right picture?
- Convnet on fashion MNIST
- transfer learning on road bike dataset
- what do you want to do as project?