Some MHA and RoPE refactoring, llama-3.1 rope_scaling #91

francoishernandez · 2024-08-28T09:50:14Z

Notes:

the RoPE refactoring was probably not strictly necessary, but it makes things clearer IMO;
models using rotary embeddings will need to be reconverted because all rope related settings were moved to a sub config for clarity;
this is not extensively tested, but seems to work fine (tested on the example prompt here for instance);
our RoPE implementation is not strictly equivalent to the HF one, because we rely on some of the "original" rope tricks based on complex space computation (.polar/.real/.imag), where HF just applies .sin()/.cos() methods, which is not numerically equivalent;
additional scaling types can be implemented (e.g. taking inspiration around here... https://github.com/huggingface/transformers/blob/f1a385b1de7e83e2be9b087d1c0646c0c426e2fc/src/transformers/modeling_rope_utils.py)

vince62s · 2024-08-28T12:21:40Z

our RoPE implementation is not strictly equivalent to the HF one, because we rely on some of the "original" rope tricks based on complex space computation (.polar/.real/.imag), where HF just applies .sin()/.cos() methods, which is not numerically equivalent

not sure about that.
When the model comes from a Hugging Face format, rotary_interleave is False, and in this case we also use cos/sin, (not the Polar formula)
But maybe I am missing something.

vince62s · 2024-08-28T12:23:16Z

Did you check the speed performance of this refactoring ? recomputing / applying rotary is quite impactful. My only concern is the reassignment of cos/sin which was performed only when shifting each 32 positions before.

francoishernandez · 2024-08-28T13:09:29Z

our RoPE implementation is not strictly equivalent to the HF one, because we rely on some of the "original" rope tricks based on complex space computation (.polar/.real/.imag), where HF just applies .sin()/.cos() methods, which is not numerically equivalent

not sure about that. When the model comes from a Hugging Face format, rotary_interleave is False, and in this case we also use cos/sin, (not the Polar formula) But maybe I am missing something.

The .polar trick I'm mentioning is in the initial rope computation:
[main]

eole/eole/modules/multi_headed_attn.py

Line 28 in 9c8cc5b

rope = torch.polar(torch.ones_like(rope), rope)

[refactor]

eole/eole/modules/rope.py

Line 139 in 4eb4853

rope = torch.polar(torch.ones_like(rope), rope)

Afterwards, both the "interleave" and "not" use the real/imag parts of the rope tensor to access cos/sin.

Did you check the speed performance of this refactoring ? recomputing / applying rotary is quite impactful. My only concern is the reassignment of cos/sin which was performed only when shifting each 32 positions before.

Not in depth. It might be worth a look indeed. We can probably have some similar sort of cache to prevent unnecessary recomputations.

francoishernandez · 2024-08-28T14:49:28Z

d79b6c3 -> similarly to the previous implementation, we pre-compute the rope further than needed to prevent computing at each step.

Not sure about the true impact as my setup might not be the most stable, but it seems that we lost ~5-10% inference speed, which we re-take here.

eole/bin/convert/convert_HF.py

vince62s · 2024-08-29T08:16:36Z

what is the benefit of setting position_embedding in transformer_decoder.py, transformer_lm_decoder.py, transformer_encoder.py vs directly in mha.py ?

eole/modules/multi_headed_attn.py

francoishernandez · 2024-08-29T12:38:48Z

what is the benefit of setting position_embedding in transformer_decoder.py, transformer_lm_decoder.py, transformer_encoder.py vs directly in mha.py ?

That's a good question. It seemed cleaner to have a single "base" RotaryPosition object to compute rope at a higher level, and use it in all underlying layers/MHA, but that's debatable.

…n general case

eole/decoders/transformer_lm_decoder.py

vince62s · 2024-08-30T13:40:57Z

good to merge.

francoishernandez added 3 commits August 26, 2024 15:22

minor renaming, docstring fixes

c61b1a6

rotary refactoring, llama3.1 rope scaling implementation

3e7c141

add minimal llama3.1 recipe

4eb4853

francoishernandez force-pushed the mha_refactor_rope_scaling branch from 2604c7c to 09a7b49 Compare August 28, 2024 13:16

fix convert_HF

3df5ef9

francoishernandez force-pushed the mha_refactor_rope_scaling branch from 09a7b49 to 3df5ef9 Compare August 28, 2024 13:22

precompute rope for some number of positions

d79b6c3

francoishernandez force-pushed the mha_refactor_rope_scaling branch from 3091a4e to d79b6c3 Compare August 28, 2024 14:52

vince62s reviewed Aug 29, 2024

View reviewed changes

eole/bin/convert/convert_HF.py Show resolved Hide resolved

vince62s reviewed Aug 29, 2024

View reviewed changes

eole/modules/multi_headed_attn.py Show resolved Hide resolved

fix rotary_interleave setting in convert_HF

54ae2c2

some cleanup

d57454e

francoishernandez marked this pull request as ready for review August 29, 2024 13:28

francoishernandez added 2 commits August 29, 2024 15:32

cleanup bis

182fe41

remove edgy offset condition, check should not create much overhead i…

cfa4e15

…n general case

vince62s reviewed Aug 30, 2024

View reviewed changes

eole/decoders/transformer_lm_decoder.py Show resolved Hide resolved

francoishernandez merged commit b81cce1 into main Aug 30, 2024
4 checks passed

francoishernandez deleted the mha_refactor_rope_scaling branch February 7, 2025 08:56

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Some MHA and RoPE refactoring, llama-3.1 rope_scaling #91

Some MHA and RoPE refactoring, llama-3.1 rope_scaling #91

francoishernandez commented Aug 28, 2024

vince62s commented Aug 28, 2024

vince62s commented Aug 28, 2024

francoishernandez commented Aug 28, 2024 •

edited

Loading

francoishernandez commented Aug 28, 2024 •

edited

Loading

vince62s commented Aug 29, 2024

francoishernandez commented Aug 29, 2024

vince62s commented Aug 30, 2024

Some MHA and RoPE refactoring, llama-3.1 rope_scaling #91

Some MHA and RoPE refactoring, llama-3.1 rope_scaling #91

Conversation

francoishernandez commented Aug 28, 2024

vince62s commented Aug 28, 2024

vince62s commented Aug 28, 2024

francoishernandez commented Aug 28, 2024 • edited Loading

francoishernandez commented Aug 28, 2024 • edited Loading

vince62s commented Aug 29, 2024

francoishernandez commented Aug 29, 2024

vince62s commented Aug 30, 2024

francoishernandez commented Aug 28, 2024 •

edited

Loading

francoishernandez commented Aug 28, 2024 •

edited

Loading