JuliaGNI · michakraus · Oct 25, 2023 · Oct 12, 2023 · Oct 12, 2023 · Oct 12, 2023
diff --git a/docs/src/data_loader/TODO.md b/docs/src/data_loader/TODO.md
@@ -1,9 +1,9 @@
 # DATA Loader TODO 
 
-1. Implement `@views` instead of allocating a new array in every step. 
-2. Implement **sampling without replacement**.
-3. Store information on the epoch and the current loss. 
-4. Usually the training loss is computed over the entire data set, we are probably going to do this for one epoch via 
+- [x] Implement `@views` instead of allocating a new array in every step. 
+- [x] Implement **sampling without replacement**.
+- [x] Store information on the epoch and the current loss. 
+- [x] Usually the training loss is computed over the entire data set, we are probably going to do this for one epoch via 
 ```math
 loss_e = \frac{1}{|batches|}\sum_{batch\in{}batches}loss(batch).
 ```

diff --git a/docs/src/layers/attention_layer.md b/docs/src/layers/attention_layer.md
@@ -1,6 +1,27 @@
 # The Attention Layer
 
-The attention layer (and the *orthonormal activation* function defined for it) was specifically designed to generalize transformers to symplectic data. 
+The *attention* mechanism was originally applied for image and natural language processing (NLP) tasks. In (Bahdanau et al, 2014) ``additive'' attention is used: 
+
+```math
+(z_q, z_k) \mapsto v^T\sigma(Wz_q + Uz_k).
+```
+
+However ``multiplicative'' attention is more straightforward to interpret and cheaper to handle computationally: 
+
+```math
+(z_q, z_k) \mapsto z_q^TWz_k.
+```
+
+Regardless of the type of attention used, they all try to compute correlations among input sequences on whose basis further neural network-based computation is performed. So given two input sequences $(z_q^{(1)}, \ldots, z_q^{(T)})$ and $(z_k^{(1)}, \ldots, z_k^{(T)})$, various attention mechanisms always return an output $C\in\mathbb{R}^{T\times{}T}$ with entries $[C]_{ij} = \mathtt{attention}(z_q^{(i)}, z_k^{(j)}$.
+
+# Self Attention 
+
+
+
+
+## Attention in `GeometricMachineLearning`
+
+The attention layer (and the *orthonormal activation* function defined for it) in `GeometricMachineLearning` was specifically designed to generalize transformers to symplectic data. 
 Usually a self-attention layer takes the following form: 
 
 ```math
@@ -54,4 +75,14 @@ Multiplying with the matrix $\Lambda(Z)$ from the right onto $[z^1, \ldots, z^T]
 \right]
 ```
 
-from the left onto the big vector. 
+from the left onto the big vector. 
+
+
+## Historical Note 
+
+Attention was used before, but always in connection with **recurrent neural networks** (see (Luong et al, 2015) and (Bahdanau et al, 2014)). 
+
+
+## References 
+- Luong M T, Pham H, Manning C D. Effective approaches to attention-based neural machine translation[J]. arXiv preprint arXiv:1508.04025, 2015.
+- Bahdanau D, Cho K, Bengio Y. Neural machine translation by jointly learning to align and translate[J]. arXiv preprint arXiv:1409.0473, 2014.
diff --git a/docs/src/layers/multihead_attention_layer.md b/docs/src/layers/multihead_attention_layer.md
@@ -23,5 +23,32 @@ The transformer contains a **self-attention mechanism**, i.e. takes an input $X$
 
 Here the various $P$ matrices can be interpreted as being projections onto lower-dimensional subspaces, hence the designation by the letter $P$. Because of this interpretation as projection matrices onto smaller spaces that should **capture features in the input data** it makes sense to constrain these elements to be part of the Stiefel manifold.   
 
+## Computing Correlations in the Multihead-Attention Layer
+
+The [attention mechanism](attention_layer.md) describes a reweighting of the "values" $V_i$ based on correlations between the "keys" $K_i$ and the "queries" $Q_i$. First note the structure of these matrices: they are all a collection of $T$ vectors $(N\div\mathtt{n\_heads})$-dimensional vectors, i.e. $V_i=[v_i^{(1)}, \ldots, v_i^{(T)}], K_i=[k_i^{(1)}, \ldots, k_i^{(T)}], Q_i=[q_i^{(1)}, \ldots, q_i^{(T)}]$ . Those vectors have been obtained by applying the respective projection matrices onto the original input $I_i\in\mathbb{R}^{N\times{}T}$.
+
+When performing the *reweighting* of the columns of $V_i$ we first compute the correlations between the vectors in $K_i$ and in $Q_i$ and store the results in a *correlation matrix* $C_i$: 
+
+```math
+    [C_i]_{mn} = \left(k_i^{(m)}\right)^Tq_i^{(n)}.
+```
+
+The columns of this correlation matrix are than rescaled with a softmax function, obtaining a matrix of *probability vectors* $\mathcal{P}_i$:
+
+```math
+    [\mathcal{P}_i]_{\bullet{}n} = \mathrm{softmax}([C_i]_{\bullet{}n}).
+```
+
+Finally the matrix $\mathcal{P}_i$ is multiplied onto $V_i$ from the right, resulting in 16 convex combinations of the 16 vectors $v_i^{(m)}$ with $m=1,\ldots,T$:
+
+```math
+    V_i\mathcal{P}_i = \left[\sum_{m=1}^{16}[\mathcal{P}_i]_{m,1}v_i^{(m)}, \ldots, \sum_{m=1}^{T}[\mathcal{P}_i]_{m,T}v_i^{(m)}\right].
+```
+
+With this we can now give a better interpretation of what the projection matrices $W_i^V$, $W_i^K$ and $W_i^Q$ should do: they map the original data to lower-dimensional subspaces. We then compute correlations between the representation in the $K$ and in the $Q$ basis and use this correlation to perform a convex reweighting of the vectors in the $V$ basis. These reweighted *values* are then fed into a standard feedforward neural network.
+
+Because the main task of the $W_i^V$, $W_i^K$ and $W_i^Q$ matrices here is for them to find bases, it makes sense to constrain them onto the Stiefel manifold; they do not and should not have the maximum possible generality.
+
+
 ## References 
 - Vaswani, Ashish, et al. "Attention is all you need." Advances in neural information processing systems 30 (2017).
diff --git a/scripts/transformer_analysis.jl b/scripts/transformer_analysis.jl
@@ -0,0 +1,91 @@
+"""
+TODO: Add a better predictor at the end! It should set the biggest value of the softmax to 1 and the rest to zero!
+"""
+
+using GeometricMachineLearning, LinearAlgebra, ProgressMeter, Plots, CUDA
+using AbstractNeuralNetworks
+import Zygote, MLDatasets
+
+# remove this after AbstractNeuralNetworks PR has been merged 
+GeometricMachineLearning.Chain(model::Chain, d::AbstractNeuralNetworks.AbstractExplicitLayer) = Chain(model.layers..., d)
+GeometricMachineLearning.Chain(d::AbstractNeuralNetworks.AbstractExplicitLayer, model::Chain) = Chain(d, model.layers...)
+
+# MNIST images are 28×28, so a sequence_length of 16 = 4² means the image patches are of size 7² = 49
+image_dim = 28
+patch_length = 7
+transformer_dim = 49
+n_heads = 7
+n_layers = 1
+number_of_patch = (image_dim÷patch_length)^2
+batch_size = 2048
+activation = softmax
+n_epochs = 500
+add_connection = false
+
+train_x, train_y = MLDatasets.MNIST(split=:train)[:]
+test_x, test_y = MLDatasets.MNIST(split=:test)[:]
+
+# use CUDA backend if available. else use CPU()
+backend, train_x, test_x, train_y, test_y = 
+    try
+        CUDABackend(),
+        train_x |> cu,
+        test_x |> cu,
+        train_y |> cu,
+        test_y |> cu
+    catch
+        CPU(), 
+        train_x, 
+        test_x, 
+        train_y, 
+        test_y
+end
+
+
+#encoder layer - final layer has to be added for evaluation purposes!
+model1 = Chain(Transformer(patch_length^2, n_heads, n_layers, Stiefel=false, add_connection=add_connection),
+	    Classification(patch_length^2, 10, activation))
+
+model2 = Chain(Transformer(patch_length^2, n_heads, n_layers, Stiefel=true, add_connection=add_connection),
+	    Classification(patch_length^2, 10, activation))
+
+# err_freq is the frequency with which the error is computed (e.g. every 100 steps)
+function transformer_training(Ψᵉ::Chain; backend=CPU(), n_epochs=100, opt=AdamOptimizer())
+    # call data loader
+    dl = DataLoader(train_x, train_y)
+    dl_test = DataLoader(test_x, test_y)
+    batch = Batch(batch_size)
+
+    ps = initialparameters(backend, eltype(dl.input), Ψᵉ) 
+
+    optimizer_instance = Optimizer(opt, ps)
+
+    println("initial test accuracy: ", GeometricMachineLearning.accuracy(Ψᵉ, ps, dl_test), "\n")
+
+    progress_object = Progress(n_epochs; enabled=true)
+
+    # use the `time` function to get the system time.
+    init_time = time()
+    total_time = init_time - time()
+
+    loss_array = zeros(eltype(train_x), n_epochs)
+    for i in 1:n_epochs
+        loss_val = optimize_for_one_epoch!(optimizer_instance, Ψᵉ, ps, dl, batch)
+
+        ProgressMeter.next!(progress_object; showvalues = [(:TrainingLoss, loss_val)])   
+        loss_array[i] = loss_val
+
+        # update runtime
+        total_time = init_time - time()
+    end
+
+    accuracy_score = GeometricMachineLearning.accuracy(Ψᵉ, ps, dl_test)
+    println("final test accuracy: ", accuracy_score, "\n")
+
+    loss_array, ps, total_time, accuracy_score
+end
+
+loss_array1, ps1, total_time1, accuracy_score1  = transformer_training(model1, backend=backend, n_epochs=n_epochs)
+loss_array2, ps2, total_time2, accuracy_score2 = transformer_training(model2, backend=backend, n_epochs=n_epochs)
+
+#display(ps1.layer_1.PQ)
diff --git a/scripts/transformer_new.jl b/scripts/transformer_new.jl
@@ -15,75 +15,100 @@ image_dim = 28
 patch_length = 7
 transformer_dim = 49
 n_heads = 7
-n_layers = 10
+n_layers = 16
 number_of_patch = (image_dim÷patch_length)^2
 batch_size = 2048
 activation = softmax
-n_epochs = 1000
+n_epochs = 500
 add_connection = false
-backend = CUDABackend()
 
 train_x, train_y = MLDatasets.MNIST(split=:train)[:]
 test_x, test_y = MLDatasets.MNIST(split=:test)[:]
-if backend == CUDABackend()
-	train_x = train_x |> cu 
-	test_x = test_x |> cu 
-	train_y = train_y |> cu 
-	test_y = test_y |> cu
+
+# use CUDA backend if available. else use CPU()
+backend, train_x, test_x, train_y, test_y = 
+    try
+        CUDABackend(),
+        train_x |> cu,
+        test_x |> cu,
+        train_y |> cu,
+        test_y |> cu
+    catch
+        CPU(), 
+        train_x, 
+        test_x, 
+        train_y, 
+        test_y
 end
 
+
 #encoder layer - final layer has to be added for evaluation purposes!
 model1 = Chain(Transformer(patch_length^2, n_heads, n_layers, Stiefel=false, add_connection=add_connection),
 	    Classification(patch_length^2, 10, activation))
 
 model2 = Chain(Transformer(patch_length^2, n_heads, n_layers, Stiefel=true, add_connection=add_connection),
 	    Classification(patch_length^2, 10, activation))
 
-
 # err_freq is the frequency with which the error is computed (e.g. every 100 steps)
-function transformer_training(Ψᵉ::Chain; backend=CPU(), n_training_steps=10000, o=AdamOptimizer())
+function transformer_training(Ψᵉ::Chain; backend=CPU(), n_epochs=100, opt=AdamOptimizer())
     # call data loader
-    dl = DataLoader(train_x, train_y, batch_size=batch_size)
-    dl_test = DataLoader(test_x, test_y, batch_size=length(test_y))
+    dl = DataLoader(train_x, train_y)
+    dl_test = DataLoader(test_x, test_y)
+    batch = Batch(batch_size)
+
+    ps = initialparameters(backend, eltype(dl.input), Ψᵉ) 
 
-    ps = initialparameters(backend, eltype(dl.data), Ψᵉ) 
+    optimizer_instance = Optimizer(opt, ps)
 
-    optimizer_instance = Optimizer(o, ps)
+    println("initial test accuracy: ", GeometricMachineLearning.accuracy(Ψᵉ, ps, dl_test), "\n")
 
-    println("initial test loss: ", loss(Ψᵉ, ps, dl_test), "\n")
+    progress_object = Progress(n_epochs; enabled=true)
 
-    progress_object = Progress(n_training_steps; enabled=true)
+    # use the `time` function to get the system time.
+    init_time = time()
+    total_time = init_time - time()
 
-    loss_array = zeros(eltype(train_x), n_training_steps)
-    for i in 1:n_training_steps
-        redraw_batch!(dl)
-        # ask Michael to take a look at this. Probably not good for performance.
-        loss_val, pb = Zygote.pullback(ps -> loss(Ψᵉ, ps, dl), ps)
-        dp = pb(one(loss_val))[1]
+    loss_array = zeros(eltype(train_x), n_epochs)
+    for i in 1:n_epochs
+        loss_val = optimize_for_one_epoch!(optimizer_instance, Ψᵉ, ps, dl, batch)
 
-        optimization_step!(optimizer_instance, Ψᵉ, ps, dp)
         ProgressMeter.next!(progress_object; showvalues = [(:TrainingLoss, loss_val)])   
         loss_array[i] = loss_val
+
+        # update runtime
+        total_time = init_time - time()
     end
 
-    println("final test loss: ", loss(Ψᵉ, ps, dl_test), "\n")
+    accuracy_score = GeometricMachineLearning.accuracy(Ψᵉ, ps, dl_test)
+    println("final test accuracy: ", accuracy_score, "\n")
 
-    loss_array, ps
+    loss_array, ps, total_time, accuracy_score
 end
 
-# calculate number of epochs
-n_training_steps = Int(ceil(length(train_y)*n_epochs/batch_size))
 
-loss_array2, ps2 = transformer_training(model2, backend=backend, n_training_steps=n_training_steps)
-loss_array1, ps1 = transformer_training(model1, backend=backend, n_training_steps=n_training_steps)
-loss_array3, ps3 = transformer_training(model2, backend=backend, n_training_steps=n_training_steps, o=GradientOptimizer(0.001))
-loss_array4, ps4 = transformer_training(model2, backend=backend, n_training_steps=n_training_steps, o=MomentumOptimizer(0.001, 0.5))
+loss_array2, ps2, total_time2, accuracy_score2 = transformer_training(model2, backend=backend, n_epochs=n_epochs)
+loss_array1, ps1, total_time1, accuracy_score1 = transformer_training(model1, backend=backend, n_epochs=n_epochs)
+loss_array3, ps3, total_time3, accuracy_score3 = transformer_training(model2, backend=backend, n_epochs=n_epochs, opt=GradientOptimizer(0.001))
+loss_array4, ps4, total_time4, accuracy_score4 = transformer_training(model2, backend=backend, n_epochs=n_epochs, opt=MomentumOptimizer(0.001, 0.5))
 
-p1 = plot(loss_array1, color=1, label="Regular weights", ylimits=(0.,1.4))
-plot!(p1, loss_array2, color=2, label="Weights on Stiefel Manifold")
+p1 = plot(loss_array1, color=1, label="Regular weights", ylimits=(0.,1.4), linewidth=2)
+plot!(p1, loss_array2, color=2, label="Weights on Stiefel Manifold", linewidth=2)
 png(p1, "Stiefel_Regular")
 
-p2 = plot(loss_array2, color=2, label="Adam", ylimits=(0.,1.4))
-plot!(p2, loss_array3, color=1, label="Gradient")
-plot!(p2, loss_array4, color=3, label="Momentum")
-png(p2, "Adam_Gradient_Momentum")
+p2 = plot(loss_array2, color=2, label="Adam", ylimits=(0.,1.4), linewidth=2)
+plot!(p2, loss_array3, color=1, label="Gradient", linewidth=2)
+plot!(p2, loss_array4, color=3, label="Momentum", linewidth=2)
+png(p2, "Adam_Gradient_Momentum")
+
+text_string = 
+    "n_epochs: " * string(n_epochs) * "\n"
+    "Regular weights:   time: " * string(total_time1) * " classification accuracy: " * string(accuracy_score1) * "\n" *
+    "Stiefel weights:   time: " * string(total_time2) * " classification accuracy: " * string(accuracy_score2) * "\n" *
+    "GradientOptimizer: time: " * string(total_time3) * " classification accuracy: " * string(accuracy_score3) * "\n" *
+    "MomentumOptimizer: time: " * string(total_time4) * " classification accuracy: " * string(accuracy_score4) * "\n"
+
+display(text_string)
+
+open("measure_times"*string(backend), "w") do file
+    write(file, text_string)
+end
diff --git a/src/GeometricMachineLearning.jl b/src/GeometricMachineLearning.jl
@@ -67,12 +67,6 @@ module GeometricMachineLearning
     include("kernels/kernel_ad_routines/vec_tensor_mul.jl")
     # export tensor_mat_mul
 
-    include("data_loader/tensor_assign.jl")
-    include("data_loader/matrix_assign.jl")
-    include("data_loader/data_loader.jl")
-    include("data_loader/mnist_utils.jl")
-    export DataLoader, redraw_batch!, loss, onehotbatch, accuracy
-
     # this defines empty retraction type structs (doesn't rely on anything)
     include("optimizers/manifold_related/retraction_types.jl")
 
@@ -174,7 +168,7 @@ module GeometricMachineLearning
     #INCLUDE ABSTRACT TRAINING integrator
     export AbstractTrainingMethod
 
-    export loss_single, loss
+    export loss_single #, loss
 
     export HnnTrainingMethod
     export LnnTrainingMethod
@@ -264,14 +258,14 @@ module GeometricMachineLearning
     # INCLUDE NEURALNET SOLUTION
 
     export SingleHistory
-    export parameters, datashape, loss
+    export parameters, datashape
     export History
     export last, sizemax, nbtraining, show
 
     include("nnsolution/history.jl")
 
     export NeuralNetSolution
-    export nn, problem, tstep, loss, history, size_history
+    export nn, problem, tstep, history, size_history
     export set_sizemax_history
 
     include("nnsolution/neural_net_solution.jl")
@@ -356,6 +350,15 @@ module GeometricMachineLearning
     export integrate, integrate_step!
 
     include("integrator/sympnet_integrator.jl")
+
+    export DataLoader, onehotbatch, accuracy
+    export Batch, optimize_for_one_epoch!
+    include("data_loader/tensor_assign.jl")
+    include("data_loader/matrix_assign.jl")
+    include("data_loader/data_loader.jl")
+    include("data_loader/mnist_utils.jl")
+    include("data_loader/batch.jl")
+
 
     include("reduced_system/system_type.jl")
     include("reduced_system/reduced_system.jl")