Add Support for Static NTK RoPE scaling for exllama/exllama_hf #2955

Panchovix · 2023-07-01T19:45:08Z

This adds support for the new NTK RoPE scaling, mentioned in turboderp/exllama#115.

After turboderp/exllama#118 got merged, exllama now supports NTK RoPE scaling.

This PR Adds the parameter "alpha_emb" which is used to apply the alpha value for NTK RoPE scaling, on the webui.

Adresses #2948

Tested on 65B models at 4K context, with 48GB VRAM (2x24) using gs 16,20, and on 33B models with 8K context.

Perplexity:
For tulu-30B-GPTQ (non-SuperHOT)

Perplexity at 2048 ctx (no compress_pos_emb, no alpha RoPE): 5.2153
Perplexity at 8192 ctx, compress_pos_emb = 4: 10.0813
Perplexity at 8192 ctx, alpha = 4: 5.3534
Perplexity at 8192 ctx, compress_pos_emb = 4, alpha = 4: 15.4406

For Tulu-30B-SuperHOT-8K-4bit-32g:

Perplexity at 8192 ctx, compress_pos_emb = 4: 5.8166
Perplexity at 8192 ctx, alpha = 4: 7.5073
Perplexity at 8192 ctx, compress_pos_emb = 4, alpha = 4: 6.0903

NOTE: for context above 6K, I suggest to keep using SuperHOT models.

pbasov · 2023-07-02T10:25:11Z

You probably want to bump jllllll/exllama to 0.0.5 as a part of this commit as well.
https://github.com/oobabooga/text-generation-webui/blob/main/requirements.txt#L26-L27

Released some hours ago: https://github.com/jllllll/exllama/releases/tag/0.0.5

Ph0rk0z · 2023-07-02T10:55:15Z

Towards the end of the CTX the model can and does break down a bit. So fair warning I guess. Airoboros 65b 1.2 started having vegetarian expressions on their face and repeating itself. I capped it off at 3200. Probably till about 3500ish is good. For an unmodified model it's still a huge huge win.

Panchovix · 2023-07-02T23:11:36Z

Towards the end of the CTX the model can and does break down a bit. So fair warning I guess. Airoboros 65b 1.2 started having vegetarian expressions on their face and repeating itself. I capped it off at 3200. Probably till about 3500ish is good. For an unmodified model it's still a huge huge win.

Up to ~3584 ctx seems to be absolutely fine on my tests. About 3.6k and above, it starts to give issues.

With alpha 4, results seems to be decent until ~5500-5600 context.

@pbasov thanks for the suggestion, updated exllama wheel version on the requeriments on the PR.

shouyiwang · 2023-07-03T03:37:36Z

I did some comparison tests and found that this PR performs much better than the existing compress_pos_embedding method.

I used a 65b model and set the context length to 4096. Then, I tried using compress_pos_embedding and alpha_emb set to 2 respectively. The difference in the quality of the generated text was obvious.

It seems that the compress_pos_embedding method is redundant and better be removed later to avoid confusing users.

Panchovix · 2023-07-03T03:57:00Z

I did some comparison tests and found that this PR performs much better than the existing compress_pos_embedding method.

I used a 65b model and set the context length to 4096. Then, I tried using compress_pos_embedding and alpha_emb set to 2 respectively. The difference in the quality of the generated text was obvious.

It seems that the compress_pos_embedding method is redundant and better be removed later to avoid confusing users.

They work different. Compression is to be used with SuperHOT LoRA/models by changing the compression of the embedding factor, while alpha NTK RoPE scaling is used to extend context on base models (for now) by chaging the rotatory embedding.

Since at the moment, there's no SuperHOT LoRAs for 65B models, static NTK RoPE scaling is the only way for that parameter count. Also, luckily this method works pretty good up to 4k (~3600) tokens.

There's no reason to remove compress for now.

shouyiwang · 2023-07-03T04:07:51Z

@Panchovix You're right. I just realized that the superHOT models rely on the compress_pos_embedding option.

By the way, the NTK RoPE embedding is a genius invention. Just changing one line of code for the embedding method gives perfect results.

shouyiwang · 2023-07-03T04:24:57Z

Since at the moment, there's no SuperHOT LoRAs for 65B models, static NTK RoPE scaling is the only way for that parameter count. Also, luckily this method works pretty good up to 4k (~3600) tokens.

If I want to achieve the best result, should I set the context length to 3500 or 4096 and simply disregard the tokens beyond 3500?

gr1336 · 2023-07-03T15:44:20Z

Since at the moment, there's no SuperHOT LoRAs for 65B models, static NTK RoPE scaling is the only way for that parameter count. Also, luckily this method works pretty good up to 4k (~3600) tokens.

If I want to achieve the best result, should I set the context length to 3500 or 4096 and simply disregard the tokens beyond 3500?

I've been testing some settings to see how the parameters impact the text generation process.

To start off, take a look at the `repetition_penalty_range` parameter:

This seems to have the biggest impact on the consistency of any model I've tested so far. Before this parameter, I was getting inconsistent text generation after the 3600~ tokens mark.
The values I saw in the results are a minimum of 384 up to 1024 tokens.
For me, the "sweet spot" in this case is 512, but I believe this will depend on the model you choose to use.
With that set to 512 I could get the 4096ctx with a very good consistency.
Superhot models are worse using alpha, I've been getting better results in the 4k context on the normal models.

Some other repetition settings that I recommend you test are the other "repetition penalties":

repetition_penalty : 1.15 through 1.20.
encoder_repetition_penalty : 1.0 to 1.025.
no_repeat_ngram_size : 0 [off] or something like 16.

Also play around with the other parameters, like `temperature`, `top-p` and `top-k`, my range is something like:

temperature : 0.4-1.0 - The higher the context, it should also be treated with a higher value, so for a 3000k+ context, a temperature of 0.7 to 0.8 might be the best option for writing "Precise".
top_k : 0 or 40 - I usually use only one of these values.
top_p : 0.5 to 0.95 - This is by far the hardest to find a good point for, since it changes so much based on context length.

I will test some more combinations of settings here, and different models, so that I can get more consistent information in a few days.

practical-dreamer

Excellent addition!

oobabooga · 2023-07-04T04:13:08Z

Thank you @Panchovix ! I can't say that I understand what this does, but the perplexity numbers look promising.

I have just renamed the parameter to alpha_value, as that's how it's called on exllama.

Barnplaid · 2023-07-05T12:28:22Z

So just to make sure, how do we use this?

Ph0rk0z · 2023-07-05T12:41:58Z

Before you load model, set alpha_value + increase seq length.

…ooga#2955)

Panchovix added 6 commits July 1, 2023 15:39

Add support for NTK RoPE scaling - server.py

6c60fdb

Add support for NTK RoPE scaling - exllama.py

be3a4cd

Add support for NTK RoPE scaling - exllama_hf.py

b436b52

Add Support for NTK RoPE scaling - loaders.py

e3e3cba

Add Support for NTK RoPE scaling - shared.py

d717a1d

Add Support for NTK RoPE scaling - ui.py

206cc18

Panchovix mentioned this pull request Jul 1, 2023

NTK-based exllama support potential #2948

Closed

Panchovix changed the title ~~Add Support for NTK RoPE scaling for exllama/exllama_hf~~ Add Support for Static NTK RoPE scaling for exllama/exllama_hf Jul 2, 2023

Panchovix added 3 commits July 2, 2023 14:13

Bumpe exllama wheel to 0.0.5

113212c

Merge branch 'oobabooga:main' into main

916832f

Update exllama wheel (fixed typo)

cae1e15

Panchovix added 2 commits July 3, 2023 23:17

Merge branch 'main' into main

2c8c612

Merge branch 'oobabooga:main' into main

9a7fdf7

practical-dreamer approved these changes Jul 4, 2023

View reviewed changes

oobabooga added 2 commits July 3, 2023 21:08

Rename the parameter

e75bd60

Spacing

c556936

oobabooga merged commit 10c8c19 into oobabooga:main Jul 4, 2023

jdehorty pushed a commit to jdehorty/text-generation-webui that referenced this pull request Jul 9, 2023

Add Support for Static NTK RoPE scaling for exllama/exllama_hf (oobab…

a8f4ef9

…ooga#2955)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Support for Static NTK RoPE scaling for exllama/exllama_hf #2955

Add Support for Static NTK RoPE scaling for exllama/exllama_hf #2955

Panchovix commented Jul 1, 2023 •

edited

Loading

pbasov commented Jul 2, 2023 •

edited

Loading

Ph0rk0z commented Jul 2, 2023

Panchovix commented Jul 2, 2023 •

edited

Loading

shouyiwang commented Jul 3, 2023 •

edited

Loading

Panchovix commented Jul 3, 2023

shouyiwang commented Jul 3, 2023

shouyiwang commented Jul 3, 2023

gr1336 commented Jul 3, 2023 •

edited

Loading

practical-dreamer left a comment

oobabooga commented Jul 4, 2023

Barnplaid commented Jul 5, 2023

Ph0rk0z commented Jul 5, 2023 •

edited

Loading

Add Support for Static NTK RoPE scaling for exllama/exllama_hf #2955

Add Support for Static NTK RoPE scaling for exllama/exllama_hf #2955

Conversation

Panchovix commented Jul 1, 2023 • edited Loading

pbasov commented Jul 2, 2023 • edited Loading

Ph0rk0z commented Jul 2, 2023

Panchovix commented Jul 2, 2023 • edited Loading

shouyiwang commented Jul 3, 2023 • edited Loading

Panchovix commented Jul 3, 2023

shouyiwang commented Jul 3, 2023

shouyiwang commented Jul 3, 2023

gr1336 commented Jul 3, 2023 • edited Loading

I've been testing some settings to see how the parameters impact the text generation process.

To start off, take a look at the repetition_penalty_range parameter:

Some other repetition settings that I recommend you test are the other "repetition penalties":

Also play around with the other parameters, like temperature, top-p and top-k, my range is something like:

I will test some more combinations of settings here, and different models, so that I can get more consistent information in a few days.

practical-dreamer left a comment

Choose a reason for hiding this comment

oobabooga commented Jul 4, 2023

Barnplaid commented Jul 5, 2023

Ph0rk0z commented Jul 5, 2023 • edited Loading

Panchovix commented Jul 1, 2023 •

edited

Loading

pbasov commented Jul 2, 2023 •

edited

Loading

Panchovix commented Jul 2, 2023 •

edited

Loading

shouyiwang commented Jul 3, 2023 •

edited

Loading

gr1336 commented Jul 3, 2023 •

edited

Loading

To start off, take a look at the `repetition_penalty_range` parameter:

Also play around with the other parameters, like `temperature`, `top-p` and `top-k`, my range is something like:

Ph0rk0z commented Jul 5, 2023 •

edited

Loading