-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Support for Static NTK RoPE scaling for exllama/exllama_hf #2955
Conversation
You probably want to bump jllllll/exllama to 0.0.5 as a part of this commit as well. Released some hours ago: https://github.com/jllllll/exllama/releases/tag/0.0.5 |
Towards the end of the CTX the model can and does break down a bit. So fair warning I guess. Airoboros 65b 1.2 started having vegetarian expressions on their face and repeating itself. I capped it off at 3200. Probably till about 3500ish is good. For an unmodified model it's still a huge huge win. |
Up to ~3584 ctx seems to be absolutely fine on my tests. About 3.6k and above, it starts to give issues. With alpha 4, results seems to be decent until ~5500-5600 context. @pbasov thanks for the suggestion, updated exllama wheel version on the requeriments on the PR. |
I did some comparison tests and found that this PR performs much better than the existing I used a 65b model and set the context length to 4096. Then, I tried using It seems that the |
They work different. Compression is to be used with SuperHOT LoRA/models by changing the compression of the embedding factor, while alpha NTK RoPE scaling is used to extend context on base models (for now) by chaging the rotatory embedding. Since at the moment, there's no SuperHOT LoRAs for 65B models, static NTK RoPE scaling is the only way for that parameter count. Also, luckily this method works pretty good up to 4k (~3600) tokens. There's no reason to remove compress for now. |
@Panchovix You're right. I just realized that the superHOT models rely on the compress_pos_embedding option. By the way, the NTK RoPE embedding is a genius invention. Just changing one line of code for the embedding method gives perfect results. |
If I want to achieve the best result, should I set the context length to 3500 or 4096 and simply disregard the tokens beyond 3500? |
I've been testing some settings to see how the parameters impact the text generation process.To start off, take a look at the
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Excellent addition!
Thank you @Panchovix ! I can't say that I understand what this does, but the perplexity numbers look promising. I have just renamed the parameter to |
So just to make sure, how do we use this? |
Before you load model, set alpha_value + increase seq length. |
This adds support for the new NTK RoPE scaling, mentioned in turboderp/exllama#115.
After turboderp/exllama#118 got merged, exllama now supports NTK RoPE scaling.
This PR Adds the parameter "alpha_emb" which is used to apply the alpha value for NTK RoPE scaling, on the webui.
Adresses #2948
Tested on 65B models at 4K context, with 48GB VRAM (2x24) using gs 16,20, and on 33B models with 8K context.
Perplexity:
For tulu-30B-GPTQ (non-SuperHOT)
For Tulu-30B-SuperHOT-8K-4bit-32g:
NOTE: for context above 6K, I suggest to keep using SuperHOT models.