A network architecture based on attention mechanism.
Here learned embeddings are used to convert the input tokens and output tokens to vectors of dimension
To provide some information about the relative or absolute position of the tokens in the sequence "positional encodings" are added to the input embeddings at the bottoms of the encoder and decoder stacks.
The positional encodings have the same dimension
Denominator is calculated in log space.
Sine and cosine functions of different frequencies are used:
In the input vector sine function is applied to the even positions and cos is applied to the odd positions.
This is where mean and variance are calculated independently for each batch and new value is calculated for each of them, Also gamma and beta are introduced to provide some fluctuation in the data.
Each of the layers in the encoder and decoder contains a fully connected feed-forward network. This consists of two linear transformations with a ReLU activation in between.
The dimensionality of input and output is
Multi-head attention takes input from the positional encoding and uses it three times. Where h
matrices. (These matrices are split along embedding dimension).
Then attention is applied to each of the split matrices and the result matrices are concatenated and multiplied by
In this work
The encoder is composed of a stack of
The decoder is also composed of a stack of
The self-attention sub-layer in the decoder stack is modified to prevent positions from attending to subsequent positions. This masking, combined with fact that the output embeddings are offset by one position, ensures that the predictions for position