diff --git a/docs/make.jl b/docs/make.jl
index fdbfb0657..f6e7ebf49 100644
--- a/docs/make.jl
+++ b/docs/make.jl
@@ -92,6 +92,7 @@ makedocs(;
             "Linear Wave Equation" => "tutorials/linear_wave_equation.md",
             "MNIST" => "tutorials/mnist_tutorial.md",
             "Grassmann manifold" => "tutorials/grassmann_layer.md",
+            "Volume-Preserving Attention" => "tutorials/volume_preserving_attention.md",
         ],
         "References" => "references.md",
         "Library" => "library.md",
diff --git a/docs/src/GeometricMachineLearning.bib b/docs/src/GeometricMachineLearning.bib
index f02a7fb06..c342eb5e1 100644
--- a/docs/src/GeometricMachineLearning.bib
+++ b/docs/src/GeometricMachineLearning.bib
@@ -264,4 +264,13 @@ @book{jacobs1992discrete
   publisher={Birkh{\"a}user Verlag},
   address={Basel, Switzerland},
   year={1992}
+}
+
+@article{feng1998step,
+  title={The step-transition operators for multi-step methods of ODE's},
+  author={Feng, Kang},
+  journal={Journal of Computational Mathematics},
+  pages={193--202},
+  year={1998},
+  publisher={JSTOR}
 }
\ No newline at end of file
diff --git a/docs/src/layers/attention_layer.md b/docs/src/layers/attention_layer.md
index fce2da1a5..dc4d68bd8 100644
--- a/docs/src/layers/attention_layer.md
+++ b/docs/src/layers/attention_layer.md
@@ -1,86 +1,156 @@
 # The Attention Layer
 
-The *attention* mechanism was originally applied for image and natural language processing (NLP) tasks. In (Bahdanau et al, 2014) ``additive'' attention is used: 
+The *attention* mechanism was originally developed for image and natural language processing (NLP) tasks. It is motivated by the need to handle time series data in an efficient way[^1]. Its essential idea is to compute correlations between vectors in input sequences. I.e. given sequences 
 
 ```math
-(z_q, z_k) \mapsto v^T\sigma(Wz_q + Uz_k).
+(z_q^{(1)}, z_q^{(2)}, \ldots, z_q^{(T)}) \text{ and } (z_p^{(1)}, z_p^{(2)}, \ldots, z_p^{(T)}),
 ```
+an attention mechanism computes pair-wise correlations between all combinations of two input vectors from these sequences. In [bahdanau2014neural](@cite) "additive" attention is used to compute such correlations: 
 
-However ``multiplicative'' attention is more straightforward to interpret and cheaper to handle computationally: 
+[^1]: *Recurrent neural networks* have the same motivation. 
 
 ```math
-(z_q, z_k) \mapsto z_q^TWz_k.
+(z_q, z_k) \mapsto v^T\sigma(Wz_q + Uz_k), 
 ```
 
-Regardless of the type of attention used, they all try to compute correlations among input sequences on whose basis further neural network-based computation is performed. So given two input sequences $(z_q^{(1)}, \ldots, z_q^{(T)})$ and $(z_k^{(1)}, \ldots, z_k^{(T)})$, various attention mechanisms always return an output $C\in\mathbb{R}^{T\times{}T}$ with entries $[C]_{ij} = \mathtt{attention}(z_q^{(i)}, z_k^{(j)}$.
+where ``z_q, z_k \in \mathbb{R}^d`` are elements of the input sequences. The learnable parameters are ``W, U \in \mathbb{R}^{n\times{}d}`` and ``v \in \mathbb{R}^n``.
 
-# Self Attention 
+However *multiplicative attention* (see e.g. [vaswani2017attention](@cite))is more straightforward to interpret and cheaper to handle computationally: 
 
+```math
+(z_q, z_k) \mapsto z_q^TWz_k,
+```
+
+where ``W \in \mathbb{R}^{d\times{}d}`` is a learnable weight matrix with respect to which correlations are computed as scalar products. Regardless of the type of attention used, they all try to compute correlations among input sequences on whose basis further computation is performed. Given two input sequences ``Z_q = (z_q^{(1)}, \ldots, z_q^{(T)})`` and ``Z_k = (z_k^{(1)}, \ldots, z_k^{(T)})``, we can arrange the various correlations into a *correlation matrix* ``C\in\mathbb{R}^{T\times{}T}`` with entries ``[C]_{ij} = \mathtt{attention}(z_q^{(i)}, z_k^{(j)})``. In the case of multiplicative attention this matrix is just ``C = Z^TWZ``.
 
+## Reweighting of the input sequence 
 
+In `GeometricMachineLearning` we always compute *self-attention*, meaning that the two input sequences ``Z_q`` and ``Z_k`` are the same, i.e. ``Z = Z_q = Z_k``.[^2]
 
-## Attention in `GeometricMachineLearning`
+[^2]: [Multihead attention](multihead_attention_layer.md) also falls into this category. Here the input ``Z`` is multiplied from the left with several *projection matrices* ``P^Q_i`` and ``P^K_i``, where ``i`` indicates the *head*. For each head we then compute a correlation matrix ``(P^Q_i Z)^T(P^K Z)``. 
 
-The attention layer (and the *orthonormal activation* function defined for it) in `GeometricMachineLearning` was specifically designed to generalize transformers to symplectic data. 
-Usually a self-attention layer takes the following form: 
+This is then used to reweight the columns in the input sequence ``Z``. For this we first apply a nonlinearity ``\sigma`` onto ``C`` and then multiply ``\sigma(C)`` onto ``Z`` from the right, i.e. the output of the attention layer is ``Z\sigma(C)``. So we perform the following mappings:
 
 ```math
-Z := [z^{(1)}, \ldots, z^{(T)}] \mapsto Z\mathrm{softmax}((P^QZ)^T(P^KZ)),
+Z \xrightarrow{\mathrm{correlations}} C(Z) =: C \xrightarrow{\sigma} \sigma(C) \xrightarrow{\text{right multiplication}} Z \sigma(C).
 ```
-where we left out the linear mapping onto the values $P^V$. 
 
-The idea behind is that we can perform a non-linear re-weighting of the columns of $Z$ by multiplying with a $Z$-dependent matrix from the right and therefore take the sequential nature of the data into account (which is not possible with normal neural networks). After the attention step the transformer applies a simple ResNet from the left.
 
-What the softmax does is a vector-wise operation, i.e. it operates on each column of an input matrix $A = [a_1, \ldots, a_T]$. The result is a sequence of probability vectors $[p^{(1)}, \ldots, p^{(T)}]$ for which 
+After the right multiplication the outputs is of the following form: 
+
+```math 
+    [\sum_{i=1}^Tp^{(1)}_iz^{(i)}, \ldots, \sum_{i=1}^Tp^{(T)}_iz^{(i)}],
+```
+for ``p^{(i)} = [\sigma(C)]_{\bullet{}i}``. What is *learned* during training are ``T`` different linear combinations of the input vectors, where the coefficients ``p^{(i)}_j`` in these linear combinations depend on the input ``Z`` nonlinearly. 
+
+## `VolumePreservingAttention` in `GeometricMachineLearning`
+
+The attention layer (and the activation function ``\sigma`` defined for it) in `GeometricMachineLearning` was specifically designed to apply it to data coming from physical systems that can be described through a divergence-free or a symplectic vector field. 
+Traditionally the nonlinearity in the attention mechanism is a softmax[^3] (see [vaswani2017attention](@cite)) and the self-attention layer performs the following mapping: 
+
+[^3]: The softmax acts on the matrix ``C`` in a vector-wise manner, i.e. it operates on each column of the input matrix ``C = [c^{(1)}, \ldots, c^{(T)}]``. The result is a sequence of probability vectors ``[p^{(1)}, \ldots, p^{(T)}]`` for which ``\sum_{i=1}^Tp^{(j)}_i=1\quad\forall{}j\in\{1,\dots,T\}.``
 
 ```math
-\sum_{i=1}^Tp^{(j)}_i=1\quad\forall{}j\in\{1,\dots,T\}.
+Z := [z^{(1)}, \ldots, z^{(T)}] \mapsto Z\mathrm{softmax}(Z^TWZ).
 ```
 
-What we want to construct is a symplectic transformation that is *transformer-like*. For this we modify the attention layer the following way: 
+The softmax activation acts vector-wise, i.e. if we supply it with a matrix ``C`` as input it returns: 
+
+```math 
+\mathrm{softmax}(C) = [\mathrm{softmax}(c_{\bullet{}1}), \ldots, \mathrm{softmax}(c_{\bullet{}T})].
+```
+
+The output of a softmax is a *probability vector* (also called *stochastic vector*) and the matrix ``P = [p^{(1)}, \ldots, p^{(T)}]``, where each column is a probability vector, is sometimes referred to as a *stochastic matrix* (see [jacobs1992discrete](@cite)). This attention mechanism finds application in *transformer neural networks* [vaswani2017attention](@cite). The problem with this matrix from a geometric point of view is that all the columns are independent of each other and the nonlinear transformation could in theory produce a stochastic matrix for which all columns are identical and thus lead to a loss of information. So the softmax activation function is inherently non-geometric. 
+
+Besides the traditional attention mechanism `GeometricMachineLearning` therefore also has a volume-preserving transformation that fulfills a similar role. There are two approaches implemented to realize similar transformations. Both of them however utilize the *Cayley transform* to produce orthogonal matrices ``\sigma(C)`` instead of stochastic matrices. For an orthogonal matrix ``\Sigma`` we have ``\Sigma^T\Sigma = \mathbb{I}``, so all the columns are linearly independent which is not necessarily true for a stochastic matrix ``P``. The following explains how this new activation function is implemented.
+
+### The Cayley transform 
+
+The Cayley transform maps from skew-symmetric matrices to orthonormal matrices[^4]. It takes the form:
+
+[^4]: A matrix ``A`` is skew-symmetric if ``A = -A^T`` and a matrix ``B`` is orthonormal if ``B^TB = \mathbb{I}``. The orthonormal matrices form a Lie group, i.e. the set of orthonormal matrices can be endowed with the structure of a differential manifold and this set also satisfies the group axioms. The corresponding Lie algebra are the skew-symmetric matrices and the Cayley transform is a so-called retraction in this case. For more details consult e.g. [hairer2006geometric](@cite) and [absil2008optimization](@cite).
 
 ```math 
-Z := [z^{(1)}, \ldots, z^{(T)}] \mapsto Z\sigma((P^QZ)^T(P^KZ)),
+\mathrm{Cayley}: A \mapsto (\mathbb{I} - A)(\mathbb{I} + A)^{-1}.
 ```
-where $\sigma(A)=\exp(\mathtt{upper\_triangular{\_asymmetrize}}(A))$ and 
+
+We can easily check that ``\mathrm{Cayley}(A)`` is orthogonal if ``A`` is skew-symmetric. For this consider ``\varepsilon \mapsto A(\varepsilon)\in\mathcal{S}_\mathrm{skew}`` with ``A(0) = \mathbb{I}`` and ``A'(0) = B``. Then we have: 
+
+```math
+\frac{\delta\mathrm{Cayley}}{\delta{}A} = \frac{d}{d\varepsilon}|_{\varepsilon=0} \mathrm{Cayley}(A(\varepsilon))^T \mathrm{Cayley}(A(\varepsilon)) = \mathbb{O}.
+```
+
+In order to use the Cayley transform as an activation function we further need a mapping from the input ``Z`` to a skew-symmetric matrix. This is realized in two ways in `GeometricMachineLearning`: via a scalar-product with a skew-symmetric weighting and via a scalar-product with an arbitrary weighting.
+
+### First approach: scalar products with a skew-symmetric weighting
+
+For this the attention layer is modified in the following way: 
+
+```math 
+Z := [z^{(1)}, \ldots, z^{(T)}] \mapsto Z\sigma(Z^TAZ),
+```
+where ``\sigma(C)=\mathrm{Cayley}(C)`` and ``A`` is a skew-symmetric matrix that is learnable, i.e. the parameters of the attention layer are stored in ``A``.
+
+### Second approach: scalar products with an arbitrary weighting
+
+For this approach we compute correlations between the input vectors with a skew-symmetric weighting. The correlations we consider here are based on: 
 
 ```math
-[\mathtt{upper\_triangular\_asymmetrize}(A)]_{ij} = \begin{cases} a_{ij} & \text{if $i<j$}  \\ -a_{ji} & \text{if $i>j$} \\ 0 & \text{else.}\end{cases}
+(z^{(2)})^TAz^{(1)}, (z^{(3)})^TAz^{(1)}, \ldots, (z^{(d)})^TAz^{(1)}, (z^{(3)})^TAz^{(2)}, \ldots, (z^{(d)})^TAz^{(2)}, \ldots, (z^{(d)})^TAz^{(d-1)}.
 ```
 
-This has as a consequence that the matrix $\Lambda(Z) := \sigma((P^QZ)^T(P^KZ))$ is orthonormal and hence preserves an *extended symplectic structure*. To make this more clear, consider that the transformer maps sequences of vectors to sequences of vectors, i.e. $V\times\cdots\times{}V \ni [z^1, \ldots, z^T] \mapsto [\hat{z}^1, \ldots, \hat{z}^T]$. We can define a symplectic structure on $V\times\cdots\times{}V$ by rearranging $[z^1, \ldots, z^T]$ into a vector. We do this in the following way: 
+So in total we consider correlations ``(z^{(i)})^Tz^{(j)}`` for which ``i > j``. We now arrange these correlations into a skew-symmetric matrix: 
 
 ```math
-\tilde{Z} = \begin{pmatrix} q^{(1)}_1 \\ q^{(2)}_1 \\ \cdots \\ q^{(T)}_1 \\ q^{(1)}_2 \\ \cdots \\ q^{(T)}_d \\ p^{(1)}_1 \\ p^{(2)}_1 \\ \cdots \\ p^{(T)}_1 \\ p^{(1)}_2 \\ \cdots \\ p^{(T)}_d \end{pmatrix}.
+C = \begin{bmatrix}
+        0               & -(z^{(2)})^TAz^{(1)} & -(z^{(3)})^TAz^{(1)} &     \ldots & -(z^{(d)})^TAz^{(1)} \\
+    (z^{(2)})^TAz^{(1)} &       0              & -(z^{(3)})^TAz^{(2)} &     \ldots & -(z^{(d)})^TAz^{(2)} \\
+    \ldots              &       \ldots         &        \ldots        &     \ldots &    \ldots             \\
+    (z^{(d)})^TAz^{(1)} & (z^{(d)})^TAz^{(2)}  & (z^{(d)})^TAz^{(3)}  &     \ldots &        0               
+\end{bmatrix}.
 ```
 
-The symplectic structure on this big space is then: 
+This correlation matrix can now again be used as an input for the Cayley transform to produce an orthogonal matrix.
 
+## How is structure preserved? 
+
+In order to discuss *how structure is preserved* we first have to define what *structure* we mean precisely. This structure is strongly inspired by traditional *multi-step methods* (see [feng1998step](@cite)). We now define what volume preservation means for the product space ``\mathbb{R}^{d}\times\cdots\times\mathbb{R}^{d}\equiv\times_\text{$T$ times}\mathbb{R}^{d}``.
+
+Consider an isomorphism ``\hat{}: \times_\text{($T$ times)}\mathbb{R}^{d}\stackrel{\approx}{\longrightarrow}\mathbb{R}^{dT}``. Specifically, this isomorphism takes the form:
 ```math
-\mathbb{J}=\begin{pmatrix}
-    \mathbb{O}_{dT} & \mathbb{I}_{dT} \\
-    -\mathbb{I}_{dT} & \mathbb{O}_{dT}
-\end{pmatrix}.
+Z =  \left[\begin{array}{cccc}
+            z_1^{(1)} &  z_1^{(2)} & \quad\cdots\quad & z_1^{(T)} \\
+            z_2^{(1)} &  z_2^{(2)} & \cdots & z_2^{(T)} \\
+            \cdots &  \cdots & \cdots & \cdots \\
+            z_d^{(1)} & z_d^{(2)} & \cdots & z_d^{(T)}
+            \end{array}\right] \mapsto 
+            \left[\begin{array}{c}  z_1^{(1)} \\ z_1^{(2)} \\ \cdots \\ z_1^{(T)} \\ z_2^{(1)} \\ \cdots \\ z_d^{(T)} \end{array}\right] =: Z_\mathrm{vec}.
 ```
 
-Multiplying with the matrix $\Lambda(Z)$ from the right onto $[z^1, \ldots, z^T]$ corresponds to applying the sparse matrix 
+The inverse of ``Z \mapsto \hat{Z} `` we refer to as ``Y \mapsto \tilde{Y}``. In the following we also write ``\hat{\varphi}`` for the mapping ``\,\hat{}\circ\varphi\circ\tilde{}\,``.
+
+__DEFINITION__:
+We say that a mapping ``\varphi: \times_\text{$T$ times}\mathbb{R}^{d} \to \times_\text{$T$ times}\mathbb{R}^{d}`` is **volume-preserving** if the associated ``\hat{\varphi}`` is volume-preserving.
+
+In the transformed coordinate system (in terms of the vector ``Z_\mathrm{vec}`` defined above) this is equivalent to multiplication by a sparse matrix ``\tilde\Lambda(Z)`` from the left:
 
 ```math
-\tilde{\Lambda}(Z)=\left[
-\begin{array}{ccc}
-   \Lambda(Z) & \cdots & \mathbb{O}_T \\
-   \vdots & \ddots & \vdots \\
-   \mathbb{O}_T & \cdots & \Lambda(Z) 
-   \end{array}
-\right]
+    \tilde{\Lambda}(Z) Z_\mathrm{vec} :=
+    \begin{pmatrix}
+    \Lambda(Z) & \mathbb{O} & \cdots  & \mathbb{O} \\
+    \mathbb{O} & \Lambda(Z) & \cdots & \mathbb{O} \\
+    \cdots & \cdots & \ddots & \cdots \\ 
+    \mathbb{O} & \mathbb{O} & \cdots & \Lambda(Z) \\
+    \end{pmatrix}
+    \left[\begin{array}{c}  z_1^{(1)} \\ z_1^{(2)} \\ \ldots \\ z_1^{(T)} \\ z_2^{(1)} \\ \ldots \\ z_d^{(T)} \end{array}\right] .
 ```
 
-from the left onto the big vector. 
+``\tilde{\Lambda}(Z)`` in m[eq:LambdaApplication]m(@latex) is easily shown to be an orthogonal matrix. 
 
 
 ## Historical Note 
 
-Attention was used before, but always in connection with **recurrent neural networks** (see (Luong et al, 2015) and (Bahdanau et al, 2014)). 
+Attention was used before, but always in connection with **recurrent neural networks** (see [luong2015effective](@cite) and [bahdanau2014neural](@cite)). 
 
 
 ## References 
diff --git a/docs/src/tutorials/volume_preserving_attention.md b/docs/src/tutorials/volume_preserving_attention.md
new file mode 100644
index 000000000..ddab31bf6
--- /dev/null
+++ b/docs/src/tutorials/volume_preserving_attention.md
@@ -0,0 +1,189 @@
+# Comparison of different `VolumePreservingAttention`
+
+In the [section of volume-preserving attention](../layers/attention_layer.md) we mentioned two ways of computing volume-preserving attention: one where we compute the correlations with a skew-symmetric matrix and one where we compute the correlations with an arbitrary matrix. Here we compare the two approaches. When calling the `VolumePreservingAttention` layer we can specify whether we want to use the skew-symmetric or the arbitrary weighting by setting the keyword `skew_sym = true` and `skew_sym = false` respectively. 
+
+In here we demonstrate the differences between the two approaches for computing correlations. For this we first generate a training set consisting of two collections of curves: (i) sine curves and (ii) cosine curve. 
+
+```@example volume_preserving_attention
+using GeometricMachineLearning # hide
+using GeometricMachineLearning: FeedForwardLoss, TransformerLoss # hide
+using Plots # hide
+import Random # hide 
+Random.seed!(123) # hide
+
+sine_cosine = zeros(1, 1000, 2)
+sine_cosine[1, :, 1] .= sin.(0.:.1:99.9)
+sine_cosine[1, :, 2] .= cos.(0.:.1:99.9)
+
+
+const dl = DataLoader(Float16.(sine_cosine))
+```
+
+The third axis (i.e. the parameter axis) has length two, meaning we have two different kinds of curves: 
+
+```@example volume_preserving_attention
+plot(dl.input[1, :, 1], label = "sine")
+plot!(dl.input[1, :, 2], label = "cosine")
+```
+
+We want to train a single neural network on both these curves. We compare three networks which are of the following form: 
+
+```math
+\mathtt{network} = \mathcal{NN}_d\circ\Psi\circ\mathcal{NN}_u,
+```
+
+where ``\mathcal{NN}_u`` refers to a neural network that scales up and ``\mathcal{NN}_d`` refers to a neural network that scales down. The up and down scaling is done with simple dense layers: 
+
+```math
+\mathcal{NN}_u(x) = \mathrm{tanh}(a_ux + b_u) \text{ and } \mathcal{NN}_d(x) = a_d^Tx + b_d,
+```
+where ``a_u, b_u, a_d\in\mathbb{R}^\mathrm{ud}`` and ``b_d`` is a scalar. `ud` refers to *upscaling dimension*. For ``\Psi`` we consider three different choices:
+1. a volume-preserving attention with skew-symmetric weighting,
+2. a volume-preserving attention with arbitrary weighting,
+3. an identity layer.
+
+We further choose a sequence length 5 (i.e. the network always sees the last 5 time steps) and always predict one step into the future (i.e. the prediction window is set to 1):
+
+```@example volume_preserving_attention
+const seq_length = 3
+const prediction_window = 1
+
+const upscale_dimension_1 = 2
+
+const T = Float16
+
+function set_up_networks(upscale_dimension::Int = upscale_dimension_1)
+    model_skew = Chain(Dense(1, upscale_dimension, tanh), VolumePreservingAttention(upscale_dimension, seq_length; skew_sym = true),  Dense(upscale_dimension, 1, identity; use_bias = true))
+    model_arb  = Chain(Dense(1, upscale_dimension, tanh), VolumePreservingAttention(upscale_dimension, seq_length; skew_sym = false), Dense(upscale_dimension, 1, identity; use_bias = true))
+    model_comp = Chain(Dense(1, upscale_dimension, tanh), Dense(upscale_dimension, 1, identity; use_bias = true))
+
+    nn_skew = NeuralNetwork(model_skew, CPU(), T)
+    nn_arb  = NeuralNetwork(model_arb,  CPU(), T)
+    nn_comp = NeuralNetwork(model_comp, CPU(), T)
+
+    nn_skew, nn_arb, nn_comp
+end
+
+nn_skew, nn_arb, nn_comp = set_up_networks()
+```
+
+We expect the third network to not be able to learn anything useful since it cannot resolve time series data: a regular feedforward network only ever sees one datum at a time. 
+
+Next we train the networks (here we pick a batch size of 30):
+
+```@example volume_preserving_attention
+function set_up_optimizers(nn_skew, nn_arb, nn_comp)
+    o_skew = Optimizer(AdamOptimizer(T), nn_skew)
+    o_arb  = Optimizer(AdamOptimizer(T), nn_arb)
+    o_comp = Optimizer(AdamOptimizer(T), nn_comp)
+
+    o_skew, o_arb, o_comp
+end
+
+o_skew, o_arb, o_comp = set_up_optimizers(nn_skew, nn_arb, nn_comp)
+
+const n_epochs = 1000
+
+const batch_size = 30
+
+const batch = Batch(batch_size, seq_length, prediction_window)
+const batch2 = Batch(batch_size)
+
+function train_networks!(nn_skew, nn_arb, nn_comp)
+    loss_array_skew = o_skew(nn_skew, dl, batch, n_epochs, TransformerLoss(batch))
+    loss_array_arb  = o_arb( nn_arb,  dl, batch, n_epochs, TransformerLoss(batch))
+    loss_array_comp = o_comp(nn_comp, dl, batch2, n_epochs, FeedForwardLoss())
+
+    loss_array_skew, loss_array_arb, loss_array_comp
+end
+
+loss_array_skew, loss_array_arb, loss_array_comp = train_networks!(nn_skew, nn_arb, nn_comp)
+
+function plot_training_losses(loss_array_skew, loss_array_arb, loss_array_comp)
+    p = plot(loss_array_skew, color = 2, label = "skew", yaxis = :log)
+    plot!(p, loss_array_arb,  color = 3, label = "arb")
+    plot!(p, loss_array_comp, color = 4, label = "comp")
+
+    p
+end
+
+plot_training_losses(loss_array_skew, loss_array_arb, loss_array_comp)
+```
+
+Looking at the training errors, we can see that the network with the skew-symmetric weighting is stuck at a relatively high error rate, whereas the loss for  the network with the arbitrary weighting is decreasing to a significantly lower level. The feedforward network without the attention mechanism is not able to learn anything useful (as was expected). 
+
+The following demonstrates the predictions of our approaches[^1]: 
+
+[^1]: Here we have to use the architectures `DummyTransformer` and `DummyNNIntegrator` to reformulate the three neural networks defined here as `NeuralNetworkIntegrator`s. Normally the user should try to use predefined architectures in `GeometricMachineLearning`, that way they never use `DummyTransformer` and `DummyNNIntegrator`. 
+
+```@example volume_preserving_attention
+initial_condition = dl.input[:, 1:seq_length, 2]
+
+function make_networks_neural_network_integrators(nn_skew, nn_arb, nn_comp)
+    nn_skew = NeuralNetwork(GeometricMachineLearning.DummyTransformer(seq_length), nn_skew.model, nn_skew.params)
+    nn_arb  = NeuralNetwork(GeometricMachineLearning.DummyTransformer(seq_length), nn_arb.model,  nn_arb.params)
+    nn_comp = NeuralNetwork(GeometricMachineLearning.DummyNNIntegrator(), nn_comp.model, nn_comp.params)
+
+    nn_skew, nn_arb, nn_comp
+end
+
+nn_skew, nn_arb, nn_comp = make_networks_neural_network_integrators(nn_skew, nn_arb, nn_comp)
+
+function produce_validation_plot(n_points::Int, nn_skew = nn_skew, nn_arb = nn_arb, nn_comp = nn_comp; initial_condition::Matrix=initial_condition, type = :cos)
+    validation_skew = iterate(nn_skew, initial_condition; n_points = n_points, prediction_window = 1)
+    validation_arb  = iterate(nn_arb,  initial_condition; n_points = n_points, prediction_window = 1)
+    validation_comp = iterate(nn_comp, initial_condition[:, 1]; n_points = n_points)
+
+    p2 = type == :cos ? plot(dl.input[1, 1:n_points, 2], color = 1, label = "reference") : plot(dl.input[1, 1:n_points, 1], color = 1, label = "reference")
+
+    plot!(validation_skew[1, :], color = 2, label = "skew")
+    plot!(p2, validation_arb[1, :], color = 3, label = "arb")
+    plot!(p2, validation_comp[1, :], color = 4, label = "comp")
+    vline!([seq_length], color = :red, label = "start of prediction")
+
+    p2 
+end
+
+p2 = produce_validation_plot(40)
+```
+In the above plot we can see that the network with the arbitrary weighting performs much better; even though the green line does not fit the blue line very well either, it manages to least qualitatively reflect the training data.  We can also plot the predictions for longer time intervals: 
+
+```@example volume_preserving_attention 
+p3 = produce_validation_plot(400)
+``` 
+
+We can also plot the comparison with the sine function: 
+
+```@example volume_preserving_attention 
+initial_condition = dl.input[:, 1:seq_length, 1]
+
+p2 = produce_validation_plot(40, initial_condition = initial_condition, type = :sin)
+```
+
+This advantage of the volume-preserving attention with arbitrary weighting may however be due to the fact that the skew-symmetric attention only has 3 learnable parameters, as opposed to 9 for the arbitrary weighting. If we increase the *upscaling dimension* the result changes: 
+
+```@example volume_preserving_attention
+const upscale_dimension_2 = 10
+
+nn_skew, nn_arb, nn_comp = set_up_networks(upscale_dimension_2)
+
+o_skew, o_arb, o_comp = set_up_optimizers(nn_skew, nn_arb, nn_comp)
+
+loss_array_skew, loss_array_arb, loss_array_comp = train_networks!(nn_skew, nn_arb, nn_comp)
+
+plot_training_losses(loss_array_skew, loss_array_arb, loss_array_comp)
+```
+
+```@example volume_preserving_attention 
+initial_condition = dl.input[:, 1:seq_length, 2]
+
+nn_skew, nn_arb, nn_comp = make_networks_neural_network_integrators(nn_skew, nn_arb, nn_comp)
+
+p2 = produce_validation_plot(40, nn_skew, nn_arb, nn_comp)
+```
+
+And for a longer time interval: 
+
+```@example volume_preserving_attention
+p3 = produce_validation_plot(200, nn_skew, nn_arb, nn_comp)
+```
\ No newline at end of file
diff --git a/src/layers/resnet.jl b/src/layers/resnet.jl
index 7e815b7ba..4d981887b 100644
--- a/src/layers/resnet.jl
+++ b/src/layers/resnet.jl
@@ -46,5 +46,5 @@ end
 end
 
 @inline function (d::Dense{M, N, false})(x::AbstractArray{T, 3}, ps::NamedTuple) where {M, N, T}
-	return d.σ.(mat_tensor_mult(ps.W, x))
+	return d.σ.(mat_tensor_mul(ps.W, x))
 end
diff --git a/test/attention_layer/attention_setup.jl b/test/attention_layer/attention_setup.jl
index efd00d6c4..b43a8c66b 100644
--- a/test/attention_layer/attention_setup.jl
+++ b/test/attention_layer/attention_setup.jl
@@ -5,8 +5,8 @@ import Random
 Random.seed!(1234)
 
 function volume_preserving_attention_tests(N, T=Float32)
-    model₁ = VolumePreservingAttention(N, skew_sym = false)
-    model₂ = VolumePreservingAttention(N, skew_sym = true)
+    model₁ = VolumePreservingAttention(N, N, skew_sym = false)
+    model₂ = VolumePreservingAttention(N, N, skew_sym = true)
 
     ps₁ = initialparameters(model₁, CPU(), T)
     ps₂ = initialparameters(model₂, CPU(), T)
@@ -22,12 +22,16 @@ function volume_preserving_attention_tests(N, T=Float32)
     @test det₂ ≈ det₃
 end
 
-# this checks the cpu version
-volume_preserving_attention_tests(10)
+function check_all(T)
+    # this checks the cpu version
+    volume_preserving_attention_tests(10, T)
 
-# this checks the "gpu versions"
-volume_preserving_attention_tests(2)
-volume_preserving_attention_tests(3)
-volume_preserving_attention_tests(4)
-volume_preserving_attention_tests(5)
-volume_preserving_attention_tests(6)
\ No newline at end of file
+    # this checks the "gpu versions"
+    volume_preserving_attention_tests(2, T)
+    volume_preserving_attention_tests(3, T)
+    volume_preserving_attention_tests(4, T)
+end
+
+check_all(Float16)
+check_all(Float32)
+check_all(Float64)
\ No newline at end of file
diff --git a/test/kernels/tensor_inverse.jl b/test/kernels/tensor_inverse.jl
index f130126ab..04c6633b4 100644
--- a/test/kernels/tensor_inverse.jl
+++ b/test/kernels/tensor_inverse.jl
@@ -1,4 +1,4 @@
-using GeometricMachineLearning: tensor_inverse2, tensor_inverse3, tensor_inverse4, tensor_inverse5
+using GeometricMachineLearning: tensor_inverse2, tensor_inverse3, tensor_inverse4, tensor_inverse5, cpu_inverse
 using Test
 import Zygote
 
@@ -98,7 +98,6 @@ end
 
 test22_inverse()
 
-
 function test22_inverse_pullback(k::Int = 10)
     A = rand(2, 2, k)
 
@@ -114,4 +113,21 @@ function test22_inverse_pullback(k::Int = 10)
     end
 end
 
-test22_inverse_pullback()
\ No newline at end of file
+test22_inverse_pullback()
+
+function test_cpu_inverse_pullback(k::Int = 10)
+    A = rand(3, 3, k)
+
+    pullback_total = Zygote.pullback(cpu_inverse, A)
+
+    out_diff = rand(3, 3, k)
+
+    for i = 1:k 
+        pullback_k = Zygote.pullback(inv, A[:, :, k])
+        
+        @test pullback_total[1][:, :, k] ≈ pullback_k[1]
+        @test pullback_total[2](out_diff)[1][:, :, k] ≈ pullback_k[2](out_diff[:, :, k])[1]
+    end
+end
+
+test_cpu_inverse_pullback()
\ No newline at end of file
diff --git a/test/optimizers/optimizer_convergence_tests/adam_with_learning_rate_decay.jl b/test/optimizers/optimizer_convergence_tests/adam_with_learning_rate_decay.jl
index 7b7326237..22a197c42 100644
--- a/test/optimizers/optimizer_convergence_tests/adam_with_learning_rate_decay.jl
+++ b/test/optimizers/optimizer_convergence_tests/adam_with_learning_rate_decay.jl
@@ -1,5 +1,6 @@
 
 using GeometricMachineLearning
+using GeometricMachineLearning: ResNetLayer
 using Test 
 import Random