RoPE embeddings #30

PRamoneda · 2024-08-13T09:50:39Z

My conclusions about changing the positional encoding are that NOPE and ALiBi do not work well for only-encoders because, compared to only-decoders, they do not understand position at all (they are permutation equivariant). However, RoPE (Rotary Position Embedding) seems promising because, although it cannot extrapolate directly, it can be trained for longer sequences with only 1000 training steps. Even if it doesn't work perfectly, it allows for relative positional encoding (we can see it as a imrpovement of sinusoidal positional encoding), which I believe makes a lot of sense in music. This is likely why the authors of Transformer++ used it. Additionally, RoPE seems to accelerate convergence and improve model stability, which is why even famous only decoder LLMS (LLAMA) use it, despite ALiBi's ability to extrapolate it is very unstable during training.

we can borrow the code from here https://github.com/lucidrains/rotary-embedding-torch/blob/main/rotary_embedding_torch/rotary_embedding_torch.py

VarunGumma · 2024-10-23T14:22:41Z

Here is a relevant paper we had written recently on the same topic: https://arxiv.org/abs/2408.11382

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RoPE embeddings #30

RoPE embeddings #30

PRamoneda commented Aug 13, 2024

VarunGumma commented Oct 23, 2024

RoPE embeddings #30

RoPE embeddings #30

Comments

PRamoneda commented Aug 13, 2024

VarunGumma commented Oct 23, 2024