Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Batch struct #84

Merged
merged 40 commits into from
Oct 25, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
40 commits
Select commit Hold shift + click to select a range
efd4898
Added a new struct Batch and outsourced some of the DataLoader functi…
benedict-96 Oct 12, 2023
b617864
Started giving desciption of general Attention layer + short history.
benedict-96 Oct 12, 2023
885f420
Added documentation.
benedict-96 Oct 12, 2023
c72e6b7
Added documentation and made the optimizer script somewhat more reada…
benedict-96 Oct 12, 2023
6d787da
Stopped exporting functions with very generic names (e.g. loss). Now …
benedict-96 Oct 12, 2023
942d999
Short test for the new Batch struct.
benedict-96 Oct 12, 2023
d19b81f
Changed the script so that it now displays the loss per epoch.
benedict-96 Oct 13, 2023
46ca983
Changed @views to @view.
benedict-96 Oct 13, 2023
0faef56
Changed script such that CUDABackend() is default, and if not availab…
benedict-96 Oct 13, 2023
c0bc454
now displaying times after simulation has finished.
benedict-96 Oct 13, 2023
7083758
Added description on optimize_for_one_epoch! output
benedict-96 Oct 13, 2023
71de340
Fixed typos (came from changing some variable names in the struct).
benedict-96 Oct 13, 2023
d54b35c
Changed @view to @views.
benedict-96 Oct 15, 2023
28178bb
Added copy to hopefully resolve the Zygote/GPU issue.
benedict-96 Oct 15, 2023
b6d6cbf
Changed some of the parameters.
benedict-96 Oct 15, 2023
c42b39a
Outputting times to a file now.
benedict-96 Oct 15, 2023
dce7471
Updated todo list. (everything done).
benedict-96 Oct 15, 2023
6a58edb
Now also printing the accuracy of each optimizer.
benedict-96 Oct 16, 2023
a57852f
Changed layer number to 16.
benedict-96 Oct 16, 2023
e77191f
Increased line width of plots and am now also outputting accuracy.
benedict-96 Oct 16, 2023
cd34a92
Added another comment regarding additive attention and started descri…
benedict-96 Oct 20, 2023
5984cca
Copy-pasted a section from the optimizer paper.
benedict-96 Oct 20, 2023
e266bb1
Now not exporting loss anymore, has to be taken care of in the test!
benedict-96 Oct 20, 2023
9836717
Now not exporting loss anymore, has to be taken care of in the test!
benedict-96 Oct 20, 2023
0f77270
Not exporting loss anymore; has to be taken care of in the tests.
benedict-96 Oct 20, 2023
009a17c
Fixed typos; am now explicitly importing loss in these files.
benedict-96 Oct 20, 2023
9e4d9d1
Fixed typo. Was calling the same routine recursively without doing an…
benedict-96 Oct 20, 2023
a9cf905
Also saving accuracy scores now.
benedict-96 Oct 23, 2023
04f83ac
Changed number of epochs and outputting that number now.
benedict-96 Oct 23, 2023
3705520
Added script that should analyse the result of a very small transform…
benedict-96 Oct 23, 2023
342e2d6
Changed number of epochs.
benedict-96 Oct 23, 2023
85e7df3
Changed number of heads back to what it was before.
benedict-96 Oct 23, 2023
c40f903
There was an error with Julia 1.8: syntax: function argument and stat…
benedict-96 Oct 25, 2023
1e3c790
Merge branch 'main' into transformer_description
benedict-96 Oct 25, 2023
60856b6
new attempt to fix J1.8 issue.
benedict-96 Oct 25, 2023
c773dec
Merge branch 'transformer_description' of https://github.com/JuliaGNI…
benedict-96 Oct 25, 2023
2a77845
New try. Made outer constructor an inner constructor.
benedict-96 Oct 25, 2023
4070f22
Now all constructors are inner constructors.
benedict-96 Oct 25, 2023
7c0dba1
Replaced two constructors with one. added hasseqlength for Batch.
benedict-96 Oct 25, 2023
473c8ae
Changed (true, false) to (<:Integer, <:Nothing).
benedict-96 Oct 25, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 4 additions & 4 deletions docs/src/data_loader/TODO.md
Original file line number Diff line number Diff line change
@@ -1,9 +1,9 @@
# DATA Loader TODO

1. Implement `@views` instead of allocating a new array in every step.
2. Implement **sampling without replacement**.
3. Store information on the epoch and the current loss.
4. Usually the training loss is computed over the entire data set, we are probably going to do this for one epoch via
- [x] Implement `@views` instead of allocating a new array in every step.
- [x] Implement **sampling without replacement**.
- [x] Store information on the epoch and the current loss.
- [x] Usually the training loss is computed over the entire data set, we are probably going to do this for one epoch via
```math
loss_e = \frac{1}{|batches|}\sum_{batch\in{}batches}loss(batch).
```
Expand Down
35 changes: 33 additions & 2 deletions docs/src/layers/attention_layer.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,27 @@
# The Attention Layer

The attention layer (and the *orthonormal activation* function defined for it) was specifically designed to generalize transformers to symplectic data.
The *attention* mechanism was originally applied for image and natural language processing (NLP) tasks. In (Bahdanau et al, 2014) ``additive'' attention is used:

```math
(z_q, z_k) \mapsto v^T\sigma(Wz_q + Uz_k).
```

However ``multiplicative'' attention is more straightforward to interpret and cheaper to handle computationally:

```math
(z_q, z_k) \mapsto z_q^TWz_k.
```

Regardless of the type of attention used, they all try to compute correlations among input sequences on whose basis further neural network-based computation is performed. So given two input sequences $(z_q^{(1)}, \ldots, z_q^{(T)})$ and $(z_k^{(1)}, \ldots, z_k^{(T)})$, various attention mechanisms always return an output $C\in\mathbb{R}^{T\times{}T}$ with entries $[C]_{ij} = \mathtt{attention}(z_q^{(i)}, z_k^{(j)}$.

# Self Attention




## Attention in `GeometricMachineLearning`

The attention layer (and the *orthonormal activation* function defined for it) in `GeometricMachineLearning` was specifically designed to generalize transformers to symplectic data.
Usually a self-attention layer takes the following form:

```math
Expand Down Expand Up @@ -54,4 +75,14 @@ Multiplying with the matrix $\Lambda(Z)$ from the right onto $[z^1, \ldots, z^T]
\right]
```

from the left onto the big vector.
from the left onto the big vector.


## Historical Note

Attention was used before, but always in connection with **recurrent neural networks** (see (Luong et al, 2015) and (Bahdanau et al, 2014)).


## References
- Luong M T, Pham H, Manning C D. Effective approaches to attention-based neural machine translation[J]. arXiv preprint arXiv:1508.04025, 2015.
- Bahdanau D, Cho K, Bengio Y. Neural machine translation by jointly learning to align and translate[J]. arXiv preprint arXiv:1409.0473, 2014.
27 changes: 27 additions & 0 deletions docs/src/layers/multihead_attention_layer.md
Original file line number Diff line number Diff line change
Expand Up @@ -23,5 +23,32 @@ The transformer contains a **self-attention mechanism**, i.e. takes an input $X$

Here the various $P$ matrices can be interpreted as being projections onto lower-dimensional subspaces, hence the designation by the letter $P$. Because of this interpretation as projection matrices onto smaller spaces that should **capture features in the input data** it makes sense to constrain these elements to be part of the Stiefel manifold.

## Computing Correlations in the Multihead-Attention Layer

The [attention mechanism](attention_layer.md) describes a reweighting of the "values" $V_i$ based on correlations between the "keys" $K_i$ and the "queries" $Q_i$. First note the structure of these matrices: they are all a collection of $T$ vectors $(N\div\mathtt{n\_heads})$-dimensional vectors, i.e. $V_i=[v_i^{(1)}, \ldots, v_i^{(T)}], K_i=[k_i^{(1)}, \ldots, k_i^{(T)}], Q_i=[q_i^{(1)}, \ldots, q_i^{(T)}]$ . Those vectors have been obtained by applying the respective projection matrices onto the original input $I_i\in\mathbb{R}^{N\times{}T}$.

When performing the *reweighting* of the columns of $V_i$ we first compute the correlations between the vectors in $K_i$ and in $Q_i$ and store the results in a *correlation matrix* $C_i$:

```math
[C_i]_{mn} = \left(k_i^{(m)}\right)^Tq_i^{(n)}.
```

The columns of this correlation matrix are than rescaled with a softmax function, obtaining a matrix of *probability vectors* $\mathcal{P}_i$:

```math
[\mathcal{P}_i]_{\bullet{}n} = \mathrm{softmax}([C_i]_{\bullet{}n}).
```

Finally the matrix $\mathcal{P}_i$ is multiplied onto $V_i$ from the right, resulting in 16 convex combinations of the 16 vectors $v_i^{(m)}$ with $m=1,\ldots,T$:

```math
V_i\mathcal{P}_i = \left[\sum_{m=1}^{16}[\mathcal{P}_i]_{m,1}v_i^{(m)}, \ldots, \sum_{m=1}^{T}[\mathcal{P}_i]_{m,T}v_i^{(m)}\right].
```

With this we can now give a better interpretation of what the projection matrices $W_i^V$, $W_i^K$ and $W_i^Q$ should do: they map the original data to lower-dimensional subspaces. We then compute correlations between the representation in the $K$ and in the $Q$ basis and use this correlation to perform a convex reweighting of the vectors in the $V$ basis. These reweighted *values* are then fed into a standard feedforward neural network.

Because the main task of the $W_i^V$, $W_i^K$ and $W_i^Q$ matrices here is for them to find bases, it makes sense to constrain them onto the Stiefel manifold; they do not and should not have the maximum possible generality.


## References
- Vaswani, Ashish, et al. "Attention is all you need." Advances in neural information processing systems 30 (2017).
91 changes: 91 additions & 0 deletions scripts/transformer_analysis.jl
Original file line number Diff line number Diff line change
@@ -0,0 +1,91 @@
"""
TODO: Add a better predictor at the end! It should set the biggest value of the softmax to 1 and the rest to zero!
"""

using GeometricMachineLearning, LinearAlgebra, ProgressMeter, Plots, CUDA
using AbstractNeuralNetworks
import Zygote, MLDatasets

# remove this after AbstractNeuralNetworks PR has been merged
GeometricMachineLearning.Chain(model::Chain, d::AbstractNeuralNetworks.AbstractExplicitLayer) = Chain(model.layers..., d)
GeometricMachineLearning.Chain(d::AbstractNeuralNetworks.AbstractExplicitLayer, model::Chain) = Chain(d, model.layers...)

# MNIST images are 28×28, so a sequence_length of 16 = 4² means the image patches are of size 7² = 49
image_dim = 28
patch_length = 7
transformer_dim = 49
n_heads = 7
n_layers = 1
number_of_patch = (image_dim÷patch_length)^2
batch_size = 2048
activation = softmax
n_epochs = 500
add_connection = false

train_x, train_y = MLDatasets.MNIST(split=:train)[:]
test_x, test_y = MLDatasets.MNIST(split=:test)[:]

# use CUDA backend if available. else use CPU()
backend, train_x, test_x, train_y, test_y =
try
CUDABackend(),
train_x |> cu,
test_x |> cu,
train_y |> cu,
test_y |> cu
catch
CPU(),
train_x,
test_x,
train_y,
test_y
end


#encoder layer - final layer has to be added for evaluation purposes!
model1 = Chain(Transformer(patch_length^2, n_heads, n_layers, Stiefel=false, add_connection=add_connection),
Classification(patch_length^2, 10, activation))

model2 = Chain(Transformer(patch_length^2, n_heads, n_layers, Stiefel=true, add_connection=add_connection),
Classification(patch_length^2, 10, activation))

# err_freq is the frequency with which the error is computed (e.g. every 100 steps)
function transformer_training(Ψᵉ::Chain; backend=CPU(), n_epochs=100, opt=AdamOptimizer())
# call data loader
dl = DataLoader(train_x, train_y)
dl_test = DataLoader(test_x, test_y)
batch = Batch(batch_size)

ps = initialparameters(backend, eltype(dl.input), Ψᵉ)

optimizer_instance = Optimizer(opt, ps)

println("initial test accuracy: ", GeometricMachineLearning.accuracy(Ψᵉ, ps, dl_test), "\n")

progress_object = Progress(n_epochs; enabled=true)

# use the `time` function to get the system time.
init_time = time()
total_time = init_time - time()

loss_array = zeros(eltype(train_x), n_epochs)
for i in 1:n_epochs
loss_val = optimize_for_one_epoch!(optimizer_instance, Ψᵉ, ps, dl, batch)

ProgressMeter.next!(progress_object; showvalues = [(:TrainingLoss, loss_val)])
loss_array[i] = loss_val

# update runtime
total_time = init_time - time()
end

accuracy_score = GeometricMachineLearning.accuracy(Ψᵉ, ps, dl_test)
println("final test accuracy: ", accuracy_score, "\n")

loss_array, ps, total_time, accuracy_score
end

loss_array1, ps1, total_time1, accuracy_score1 = transformer_training(model1, backend=backend, n_epochs=n_epochs)
loss_array2, ps2, total_time2, accuracy_score2 = transformer_training(model2, backend=backend, n_epochs=n_epochs)

#display(ps1.layer_1.PQ)
99 changes: 62 additions & 37 deletions scripts/transformer_new.jl
Original file line number Diff line number Diff line change
Expand Up @@ -15,75 +15,100 @@ image_dim = 28
patch_length = 7
transformer_dim = 49
n_heads = 7
n_layers = 10
n_layers = 16
number_of_patch = (image_dim÷patch_length)^2
batch_size = 2048
activation = softmax
n_epochs = 1000
n_epochs = 500
add_connection = false
backend = CUDABackend()

train_x, train_y = MLDatasets.MNIST(split=:train)[:]
test_x, test_y = MLDatasets.MNIST(split=:test)[:]
if backend == CUDABackend()
train_x = train_x |> cu
test_x = test_x |> cu
train_y = train_y |> cu
test_y = test_y |> cu

# use CUDA backend if available. else use CPU()
backend, train_x, test_x, train_y, test_y =
try
CUDABackend(),
train_x |> cu,
test_x |> cu,
train_y |> cu,
test_y |> cu
catch
CPU(),
train_x,
test_x,
train_y,
test_y
end


#encoder layer - final layer has to be added for evaluation purposes!
model1 = Chain(Transformer(patch_length^2, n_heads, n_layers, Stiefel=false, add_connection=add_connection),
Classification(patch_length^2, 10, activation))

model2 = Chain(Transformer(patch_length^2, n_heads, n_layers, Stiefel=true, add_connection=add_connection),
Classification(patch_length^2, 10, activation))


# err_freq is the frequency with which the error is computed (e.g. every 100 steps)
function transformer_training(Ψᵉ::Chain; backend=CPU(), n_training_steps=10000, o=AdamOptimizer())
function transformer_training(Ψᵉ::Chain; backend=CPU(), n_epochs=100, opt=AdamOptimizer())
# call data loader
dl = DataLoader(train_x, train_y, batch_size=batch_size)
dl_test = DataLoader(test_x, test_y, batch_size=length(test_y))
dl = DataLoader(train_x, train_y)
dl_test = DataLoader(test_x, test_y)
batch = Batch(batch_size)

ps = initialparameters(backend, eltype(dl.input), Ψᵉ)

ps = initialparameters(backend, eltype(dl.data), Ψᵉ)
optimizer_instance = Optimizer(opt, ps)

optimizer_instance = Optimizer(o, ps)
println("initial test accuracy: ", GeometricMachineLearning.accuracy(Ψᵉ, ps, dl_test), "\n")

println("initial test loss: ", loss(Ψᵉ, ps, dl_test), "\n")
progress_object = Progress(n_epochs; enabled=true)

progress_object = Progress(n_training_steps; enabled=true)
# use the `time` function to get the system time.
init_time = time()
total_time = init_time - time()

loss_array = zeros(eltype(train_x), n_training_steps)
for i in 1:n_training_steps
redraw_batch!(dl)
# ask Michael to take a look at this. Probably not good for performance.
loss_val, pb = Zygote.pullback(ps -> loss(Ψᵉ, ps, dl), ps)
dp = pb(one(loss_val))[1]
loss_array = zeros(eltype(train_x), n_epochs)
for i in 1:n_epochs
loss_val = optimize_for_one_epoch!(optimizer_instance, Ψᵉ, ps, dl, batch)

optimization_step!(optimizer_instance, Ψᵉ, ps, dp)
ProgressMeter.next!(progress_object; showvalues = [(:TrainingLoss, loss_val)])
loss_array[i] = loss_val

# update runtime
total_time = init_time - time()
end

println("final test loss: ", loss(Ψᵉ, ps, dl_test), "\n")
accuracy_score = GeometricMachineLearning.accuracy(Ψᵉ, ps, dl_test)
println("final test accuracy: ", accuracy_score, "\n")

loss_array, ps
loss_array, ps, total_time, accuracy_score
end

# calculate number of epochs
n_training_steps = Int(ceil(length(train_y)*n_epochs/batch_size))

loss_array2, ps2 = transformer_training(model2, backend=backend, n_training_steps=n_training_steps)
loss_array1, ps1 = transformer_training(model1, backend=backend, n_training_steps=n_training_steps)
loss_array3, ps3 = transformer_training(model2, backend=backend, n_training_steps=n_training_steps, o=GradientOptimizer(0.001))
loss_array4, ps4 = transformer_training(model2, backend=backend, n_training_steps=n_training_steps, o=MomentumOptimizer(0.001, 0.5))
loss_array2, ps2, total_time2, accuracy_score2 = transformer_training(model2, backend=backend, n_epochs=n_epochs)
loss_array1, ps1, total_time1, accuracy_score1 = transformer_training(model1, backend=backend, n_epochs=n_epochs)
loss_array3, ps3, total_time3, accuracy_score3 = transformer_training(model2, backend=backend, n_epochs=n_epochs, opt=GradientOptimizer(0.001))
loss_array4, ps4, total_time4, accuracy_score4 = transformer_training(model2, backend=backend, n_epochs=n_epochs, opt=MomentumOptimizer(0.001, 0.5))

p1 = plot(loss_array1, color=1, label="Regular weights", ylimits=(0.,1.4))
plot!(p1, loss_array2, color=2, label="Weights on Stiefel Manifold")
p1 = plot(loss_array1, color=1, label="Regular weights", ylimits=(0.,1.4), linewidth=2)
plot!(p1, loss_array2, color=2, label="Weights on Stiefel Manifold", linewidth=2)
png(p1, "Stiefel_Regular")

p2 = plot(loss_array2, color=2, label="Adam", ylimits=(0.,1.4))
plot!(p2, loss_array3, color=1, label="Gradient")
plot!(p2, loss_array4, color=3, label="Momentum")
png(p2, "Adam_Gradient_Momentum")
p2 = plot(loss_array2, color=2, label="Adam", ylimits=(0.,1.4), linewidth=2)
plot!(p2, loss_array3, color=1, label="Gradient", linewidth=2)
plot!(p2, loss_array4, color=3, label="Momentum", linewidth=2)
png(p2, "Adam_Gradient_Momentum")

text_string =
"n_epochs: " * string(n_epochs) * "\n"
"Regular weights: time: " * string(total_time1) * " classification accuracy: " * string(accuracy_score1) * "\n" *
"Stiefel weights: time: " * string(total_time2) * " classification accuracy: " * string(accuracy_score2) * "\n" *
"GradientOptimizer: time: " * string(total_time3) * " classification accuracy: " * string(accuracy_score3) * "\n" *
"MomentumOptimizer: time: " * string(total_time4) * " classification accuracy: " * string(accuracy_score4) * "\n"

display(text_string)

open("measure_times"*string(backend), "w") do file
write(file, text_string)
end
21 changes: 12 additions & 9 deletions src/GeometricMachineLearning.jl
Original file line number Diff line number Diff line change
Expand Up @@ -67,12 +67,6 @@ module GeometricMachineLearning
include("kernels/kernel_ad_routines/vec_tensor_mul.jl")
# export tensor_mat_mul

include("data_loader/tensor_assign.jl")
include("data_loader/matrix_assign.jl")
include("data_loader/data_loader.jl")
include("data_loader/mnist_utils.jl")
export DataLoader, redraw_batch!, loss, onehotbatch, accuracy

# this defines empty retraction type structs (doesn't rely on anything)
include("optimizers/manifold_related/retraction_types.jl")

Expand Down Expand Up @@ -174,7 +168,7 @@ module GeometricMachineLearning
#INCLUDE ABSTRACT TRAINING integrator
export AbstractTrainingMethod

export loss_single, loss
export loss_single #, loss

export HnnTrainingMethod
export LnnTrainingMethod
Expand Down Expand Up @@ -264,14 +258,14 @@ module GeometricMachineLearning
# INCLUDE NEURALNET SOLUTION

export SingleHistory
export parameters, datashape, loss
export parameters, datashape
export History
export last, sizemax, nbtraining, show

include("nnsolution/history.jl")

export NeuralNetSolution
export nn, problem, tstep, loss, history, size_history
export nn, problem, tstep, history, size_history
export set_sizemax_history

include("nnsolution/neural_net_solution.jl")
Expand Down Expand Up @@ -356,6 +350,15 @@ module GeometricMachineLearning
export integrate, integrate_step!

include("integrator/sympnet_integrator.jl")

export DataLoader, onehotbatch, accuracy
export Batch, optimize_for_one_epoch!
include("data_loader/tensor_assign.jl")
include("data_loader/matrix_assign.jl")
include("data_loader/data_loader.jl")
include("data_loader/mnist_utils.jl")
include("data_loader/batch.jl")


include("reduced_system/system_type.jl")
include("reduced_system/reduced_system.jl")
Expand Down
Loading