Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Support for Static NTK RoPE scaling for exllama/exllama_hf #2955

Merged
merged 13 commits into from
Jul 4, 2023
Merged

Add Support for Static NTK RoPE scaling for exllama/exllama_hf #2955

merged 13 commits into from
Jul 4, 2023

Conversation

Panchovix
Copy link
Contributor

@Panchovix Panchovix commented Jul 1, 2023

This adds support for the new NTK RoPE scaling, mentioned in turboderp/exllama#115.

After turboderp/exllama#118 got merged, exllama now supports NTK RoPE scaling.

This PR Adds the parameter "alpha_emb" which is used to apply the alpha value for NTK RoPE scaling, on the webui.

Adresses #2948

Tested on 65B models at 4K context, with 48GB VRAM (2x24) using gs 16,20, and on 33B models with 8K context.

image

Perplexity:
For tulu-30B-GPTQ (non-SuperHOT)

  • Perplexity at 2048 ctx (no compress_pos_emb, no alpha RoPE): 5.2153
  • Perplexity at 8192 ctx, compress_pos_emb = 4: 10.0813
  • Perplexity at 8192 ctx, alpha = 4: 5.3534
  • Perplexity at 8192 ctx, compress_pos_emb = 4, alpha = 4: 15.4406

For Tulu-30B-SuperHOT-8K-4bit-32g:

  • Perplexity at 8192 ctx, compress_pos_emb = 4: 5.8166
  • Perplexity at 8192 ctx, alpha = 4: 7.5073
  • Perplexity at 8192 ctx, compress_pos_emb = 4, alpha = 4: 6.0903

NOTE: for context above 6K, I suggest to keep using SuperHOT models.

@Panchovix Panchovix changed the title Add Support for NTK RoPE scaling for exllama/exllama_hf Add Support for Static NTK RoPE scaling for exllama/exllama_hf Jul 2, 2023
@pbasov
Copy link

pbasov commented Jul 2, 2023

You probably want to bump jllllll/exllama to 0.0.5 as a part of this commit as well.
https://github.com/oobabooga/text-generation-webui/blob/main/requirements.txt#L26-L27

Released some hours ago: https://github.com/jllllll/exllama/releases/tag/0.0.5

@Ph0rk0z
Copy link
Contributor

Ph0rk0z commented Jul 2, 2023

Towards the end of the CTX the model can and does break down a bit. So fair warning I guess. Airoboros 65b 1.2 started having vegetarian expressions on their face and repeating itself. I capped it off at 3200. Probably till about 3500ish is good. For an unmodified model it's still a huge huge win.

@Panchovix
Copy link
Contributor Author

Panchovix commented Jul 2, 2023

Towards the end of the CTX the model can and does break down a bit. So fair warning I guess. Airoboros 65b 1.2 started having vegetarian expressions on their face and repeating itself. I capped it off at 3200. Probably till about 3500ish is good. For an unmodified model it's still a huge huge win.

Up to ~3584 ctx seems to be absolutely fine on my tests. About 3.6k and above, it starts to give issues.

With alpha 4, results seems to be decent until ~5500-5600 context.

@pbasov thanks for the suggestion, updated exllama wheel version on the requeriments on the PR.

@shouyiwang
Copy link
Contributor

shouyiwang commented Jul 3, 2023

I did some comparison tests and found that this PR performs much better than the existing compress_pos_embedding method.

I used a 65b model and set the context length to 4096. Then, I tried using compress_pos_embedding and alpha_emb set to 2 respectively. The difference in the quality of the generated text was obvious.

It seems that the compress_pos_embedding method is redundant and better be removed later to avoid confusing users.

@Panchovix
Copy link
Contributor Author

I did some comparison tests and found that this PR performs much better than the existing compress_pos_embedding method.

I used a 65b model and set the context length to 4096. Then, I tried using compress_pos_embedding and alpha_emb set to 2 respectively. The difference in the quality of the generated text was obvious.

It seems that the compress_pos_embedding method is redundant and better be removed later to avoid confusing users.

They work different. Compression is to be used with SuperHOT LoRA/models by changing the compression of the embedding factor, while alpha NTK RoPE scaling is used to extend context on base models (for now) by chaging the rotatory embedding.

Since at the moment, there's no SuperHOT LoRAs for 65B models, static NTK RoPE scaling is the only way for that parameter count. Also, luckily this method works pretty good up to 4k (~3600) tokens.

There's no reason to remove compress for now.

@shouyiwang
Copy link
Contributor

@Panchovix You're right. I just realized that the superHOT models rely on the compress_pos_embedding option.

By the way, the NTK RoPE embedding is a genius invention. Just changing one line of code for the embedding method gives perfect results.

@shouyiwang
Copy link
Contributor

Since at the moment, there's no SuperHOT LoRAs for 65B models, static NTK RoPE scaling is the only way for that parameter count. Also, luckily this method works pretty good up to 4k (~3600) tokens.

If I want to achieve the best result, should I set the context length to 3500 or 4096 and simply disregard the tokens beyond 3500?

@gr1336
Copy link

gr1336 commented Jul 3, 2023

Since at the moment, there's no SuperHOT LoRAs for 65B models, static NTK RoPE scaling is the only way for that parameter count. Also, luckily this method works pretty good up to 4k (~3600) tokens.

If I want to achieve the best result, should I set the context length to 3500 or 4096 and simply disregard the tokens beyond 3500?

I've been testing some settings to see how the parameters impact the text generation process.

To start off, take a look at the repetition_penalty_range parameter:

  • This seems to have the biggest impact on the consistency of any model I've tested so far. Before this parameter, I was getting inconsistent text generation after the 3600~ tokens mark.
  • The values I saw in the results are a minimum of 384 up to 1024 tokens.
  • For me, the "sweet spot" in this case is 512, but I believe this will depend on the model you choose to use.
  • With that set to 512 I could get the 4096ctx with a very good consistency.
  • Superhot models are worse using alpha, I've been getting better results in the 4k context on the normal models.

Some other repetition settings that I recommend you test are the other "repetition penalties":

  • repetition_penalty : 1.15 through 1.20.
  • encoder_repetition_penalty : 1.0 to 1.025.
  • no_repeat_ngram_size : 0 [off] or something like 16.

Also play around with the other parameters, like temperature, top-p and top-k, my range is something like:

  • temperature : 0.4-1.0 - The higher the context, it should also be treated with a higher value, so for a 3000k+ context, a temperature of 0.7 to 0.8 might be the best option for writing "Precise".
  • top_k : 0 or 40 - I usually use only one of these values.
  • top_p : 0.5 to 0.95 - This is by far the hardest to find a good point for, since it changes so much based on context length.

I will test some more combinations of settings here, and different models, so that I can get more consistent information in a few days.

Copy link
Contributor

@practical-dreamer practical-dreamer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Excellent addition!

@oobabooga
Copy link
Owner

Thank you @Panchovix ! I can't say that I understand what this does, but the perplexity numbers look promising.

I have just renamed the parameter to alpha_value, as that's how it's called on exllama.

@oobabooga oobabooga merged commit 10c8c19 into oobabooga:main Jul 4, 2023
@Barnplaid
Copy link

So just to make sure, how do we use this?

@Ph0rk0z
Copy link
Contributor

Ph0rk0z commented Jul 5, 2023

Before you load model, set alpha_value + increase seq length.

jdehorty pushed a commit to jdehorty/text-generation-webui that referenced this pull request Jul 9, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants