Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Modified RoPE with linear scaling #2019

Closed
wants to merge 2 commits into from
Closed

Conversation

ikawrakow
Copy link
Contributor

This PR adds the ability to use context sizes greater than the training context size by applying linear scaling to the token position before applying the RoPE operation. The idea originally came from (https://github.com/kaiokendev). See also the discussion #1965. This can be used "out-of-the box" for context sizes of to, say, 3072. For even larger contexts one is most likely better off fine tuning the model as discussed in #1965 and elsewhere around the Internet (but perhaps fine tuning starting from this PR would give better results?).

PR #1967 is similar to this PR. The main difference is how the linear scaling is being used. In PR #1967 the positional scaling factor is a constant defined at compile time and therefore affects evaluation also when the context size being used is less than the maximum training context size. This leads to much higher perplexities for context sizes up to 2048 (the training context size of the LLaMA models), e.g., we get ppl = 7 at a context size of 512 instead of the ppl=5.9066 obtained without scaling. In this PR we define a compile time constant (LLAMA_TRAINIG_CTX, which in turn defines GGML_TRAINING_CTX) that is the training context size of the model being used. Further, we pass the current context size n_ctx to the RoPE operation when constructing the computational graph. When the graph is evaluated, we use the actual token position when n_ctx <= GGML_TRAINING_CTX, but scale the position with GGML_TRAINING_CTX / n_ctx when n_ctx > GGML_TRAINING_CTX.

Interestingly enough, it is better to set GGML_TRAINING_CTX to 2176 (2048 + 128) and not 2048 (better in the sense that we get slightly lower perplexities).

The following table gives a summary of perplexities obtained for the 7B LLaMA model for context sizes of up to 5120. It is currently not possible to run with n_ctx > 5120 at 7B on CUDA (we get NaNs and @JohannesGaessler is looking into the issue). Q6_K quantization is used, which is known to match fp16 perplexity to within 0.1%.

n_ctx Perplexity
1024 5.4351
2048 5.2855
2560 5.2232
3072 5.3003
3584 5.4725
4096 5.6743
4608 5.8487
5120 6.1153

Interesting to note that we actually outperform the n_ctx = 2048 result at n_ctx = 2560, i.e, we do get some "free lunch" :-)

The next table gives the results for 13B, where we currently can only go up to about n_ctx = 3540 on CUDA before getting NaNs:

n_ctx Perplexity
2048 4.7094
2560 4.6459
3072 4.6868
3584 4.8135

Here the n_ctx = 2048 result is outperformed up to 3072 tokens.

For n_ctx <= 3072 linear scaling works best. For n_ctx > 3072 one can get slightly better results by using, e.g.,

p = p0 * a_scale / (1 + p0 / b_scale)

where p0 is the original token position and p is the token position given to RoPE. The constants a_scale and b_scale that seem to work best are given by

a_scale = sqrtf(GGML_TRAINING_CTX / n_ctx)
b_scale = a_scale * n_ctx / (1 - a_scale)

(with this being applied only when n_ctx > GGML_TRAINING_CTX). Using this approach I get ppl = 5.6472 for n_ctx = 4096 and 7B. With the gain being quite modest compared to just linear scaling and the approach more complicated, I did not add this to the PR.

When the context size is greater than the maximum context size
during training, scale the position given to RoPE with
trainign context / n_ctx.
@ikawrakow ikawrakow requested a review from ggerganov June 27, 2023 13:20
@FNsi
Copy link
Contributor

FNsi commented Jun 27, 2023

Did you try with higher context in larger model (for example 16k 13b)?

@ikawrakow
Copy link
Contributor Author

ikawrakow commented Jun 27, 2023

Did you try with higher context in 13b like 16384?

No because, as mentioned in the PR, currently the CUDA implementation in llama.cpp does not work beyond 3.5k for 13B and I don't have the patience to run it on the CPU. We will try and refine once the CUDA issue has been fixed.

@slaren
Copy link
Member

slaren commented Jun 27, 2023

If you are looking for numerical stability issues in the CUDA implementation, I would suggest looking at the soft max kernel.

@ikawrakow
Copy link
Contributor Author

If you are looking for numerical stability issues in the CUDA implementation, I would suggest looking at the soft max kernel.

It is not a matter of numerical stability. It is a matter of buffer sizes not being estimated correctly for large context sizes and stuff in the scratch buffers being overwritten (or not being written at the right place). The size of the scratch buffers being incorrect also affects the CPU code. It is just that this happens somewhat later (larger context sizes) and llama.cpp stops with an assert instead of producing NaNs. E.g., if I run with a 7B model

./bin/perplexity -m q6k.bin -f ../tests/wikitext-2-raw/wiki.test.raw -s 1234 -t 16 -c 8192

I get

ggml_new_tensor_impl: not enough space in the scratch memory pool (needed 545259520, available 536870912)

When the code was developed context sizes greater than 2048 weren't a possibility, so to some extent it is not surprising that it does not work when the assumption n_ctx <= 2048 is no longer true.

@slaren
Copy link
Member

slaren commented Jun 27, 2023

That will be easier to fix then. I know that the soft max kernel produces infs and nans in F16, so it seemed to be the same issue. The way soft max is currently implemented is prone to numerical stability issues and should be fixed eventually.

@FNsi
Copy link
Contributor

FNsi commented Jun 27, 2023

I built #1967 with openblas and had garbage result in fine turning 16k model.

Might this give u some ideas in the cuda version?

@jxy
Copy link
Contributor

jxy commented Jun 27, 2023

The calculation of scratch buffer size is unclear. The following works for 8k context.

diff --git a/llama.cpp b/llama.cpp
index 1a15844..eb112a2 100644
--- a/llama.cpp
+++ b/llama.cpp
@@ -83,9 +83,9 @@ static const std::map<e_model, size_t> & MEM_REQ_SCRATCH0()
     static std::map<e_model, size_t> k_sizes = {
         { MODEL_3B,    256ull * MB },
         { MODEL_7B,    512ull * MB },
-        { MODEL_13B,   512ull * MB },
-        { MODEL_30B,   512ull * MB },
-        { MODEL_65B,  1024ull * MB },
+        { MODEL_13B,  1024ull * MB },
+        { MODEL_30B,  1024ull * MB },
+        { MODEL_65B,  2048ull * MB },
     };
     return k_sizes;
 }
@@ -715,7 +715,7 @@ struct llama_model_loader {
         *ctx_size_p = *mmapped_size_p = 0;
         for (const llama_load_tensor & lt : tensors_map.tensors) {
             *ctx_size_p += sizeof(struct ggml_tensor) + GGML_OBJECT_SIZE;
-            *(use_mmap ? mmapped_size_p : ctx_size_p) += lt.size;
+            *(use_mmap ? mmapped_size_p : ctx_size_p) += lt.size + 16;
         }
     }
 

@SlyEcho
Copy link
Collaborator

SlyEcho commented Jun 27, 2023

What is the effect on models other than LLaMA?

@ikawrakow
Copy link
Contributor Author

What is the effect on models other than LLaMA?

If your use is within their respective training context size, as it has been until now, none. Otherwise, you try and see what happens.

@SlyEcho SlyEcho added enhancement New feature or request generation quality Quality of model output labels Jun 27, 2023
@Midaychi
Copy link

It's possible if a base llama or open-llama model was completely fine-tuned using this rope method (rather than just with this superhot lora) it might achieve better overall results and could possibly have no context ceiling beyond the realistic limitations of memory.

@@ -72,6 +72,7 @@ set(LLAMA_CUDA_DMMV_X "32" CACHE STRING "llama: x stride for dmmv CUDA kern
set(LLAMA_CUDA_DMMV_Y "1" CACHE STRING "llama: y block size for dmmv CUDA kernels")
option(LLAMA_CUDA_DMMV_F16 "llama: use 16 bit floats for dmmv CUDA kernels" OFF)
set(LLAMA_CUDA_KQUANTS_ITER "2" CACHE STRING "llama: iters./thread per block for Q2_K/Q6_K")
set(LLAMA_TRAINIG_CTX "2176" CACHE STRING "llama: model training maximum context")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be LLAMA_TRAINING_CTX?

@KerfuffleV2
Copy link
Collaborator

@ikawrakow What do you think of this idea? #1965 (comment)

Pass in an array of specific scales with an entry for each possible context length. It could even use the existing options subtensor for storage.

This would allow trying out different approaches for scales and also other models in the future might use RoPE and benefit from this sort of thing but require a different scale than LLaMA.

@ikawrakow
Copy link
Contributor Author

Sounds interesting. But isn't it better to first merge this and then try to refine? Btw, I have gone on vacation and will no be very responsive in the next 2 weeks

@JianbangZ
Copy link

Once this PR merged, how do I expect to use it? do I need to requantize the model? or just re make the llama.cpp and set n_ctx on the fly?

@ggerganov
Copy link
Member

I don't think we need the new defines. We should pass all parameters from the user code through the opt tensor of ggml_rope(). We can also utilize the mode parameter to support both the original interpolation method from #1967 and the one proposed here (and potentially new ones proposed in the future)

@ikawrakow
Copy link
Contributor Author

@ggerganov The main value of this PR is not the 20-30 lines code change, which was a matter of 5 minutes of effort. The PR shows how to correctly apply position scaling, which was missed by everybody, yourself included. Finding how to do it better compared to what was out there was where the 2-3 days effort went. Once this was done, it was clear that one needs the model training context size, which is a model hyper parameter that has been completely missed so far. Instead of embarking on a much bigger change to add this new hyper parameter to ggml/llama.cpp, I took the pragmatic approach of defining it via a preprocessor macro. This seemed appropriate, considering that the number of training context windows of models using RoPE and supported by llama.cpp is exactly one.

Concerning using mode: Yes I considered that, but decided not to add a new mode. Why? Very simple: what is the point of keeping mode 0 unmodified and adding a new mode in view of a) mode 0 being identical to the new mode for contexts less or equal to the training context size and b) mode 0 completely falling apart for contexts greater than the training context window? Basically, you can see this PR as fixing mode 0 by letting it extend in a reasonable way where it failed completely before, while keeping it's behavior unmodified where the original version works

@ikawrakow
Copy link
Contributor Author

Once this PR merged, how do I expect to use it? do I need to requantize the model? or just re make the llama.cpp and set n_ctx on the fly?

No, no need to requantize. Just specify the context window to use via -c on the command line.

@Midaychi
Copy link

https://www.reddit.com/r/LocalLLaMA/comments/14lz7j5/ntkaware_scaled_rope_allows_llama_models_to_have/
This was posted not too long ago; is this helpful?

@ikawrakow ikawrakow closed this Jun 29, 2023
@LostRuins
Copy link
Collaborator

Why close this PR?

@TheBloke
Copy link
Contributor

TheBloke commented Jun 29, 2023

I was just about to start uploading dozens of SuperHOT GGML repos.

KoboldCpp just added support for RoPE, based on this PR, so I was going to start uploading on that basis. But my READMEs were also going to link to this PR and tell users the feature was coming soon to llama.cpp.

It'd be great to know what's happening with this PR and why it's been closed?

@Nexesenex
Copy link
Contributor

Well, it works or it doesn't.
Upload at least one model Tom, please, so we can test it on the last KoboldCpp !

@TheBloke
Copy link
Contributor

Don't worry, I will.

But I'd like to know what to tell people about upcoming llama.cpp support. I thought it was imminent but if it's not I'd like to tell people that it'll only work in KoboldCpp until further notice.

@maddes8cht
Copy link
Contributor

maddes8cht commented Jun 29, 2023

Ouch, this is sad.

Will there be a new improved PR?
What's going on?

Some bad communication?

I've been watching this thread since it's existence and was eagerly waiting for it to be merged.

@ikawrakow
Copy link
Contributor Author

In light of today's XGen-7B announcement, having the training context length as a compile time option is not a good approach. That's why I closed this PR. A better approach will hopefully come soon.

@TheBloke
Copy link
Contributor

Ah OK.

And you couldn't merge this now and then improve the implementation as a separate PR?

That would allow people to try the SuperHOT GGMLs now, which many people would love to do. It's great that KoboldCpp supports it, but llama.cpp support will then also bring llama-cpp-python and ctransformers, opening it up to several more UIs, and Python code.

PS. Am I right in assuming that XGen GGML will take a bit of work, given it requires a different tokeniser?

@LostRuins
Copy link
Collaborator

@ikawrakow how about, instead of a compile time constant, assume that the training ctxlen is equal to the value in the config's max_position_embeddings which GGML can read? Is that a valid assumption?

XGen-7B has max_position_embeddings = 8192
a typical llama model has max_position_embeddings = 2048

@maddes8cht
Copy link
Contributor

maddes8cht commented Jun 29, 2023

In light of today's XGen-7B announcement, having the training context length as a compile time option is not a good approach. That's why I closed this PR. A better approach will hopefully come soon.

XGen-7b is a complete different model.
This doesn't make any sense at all.
It's like proposing we don't need llama anymore because of XGen.
We didn't say that with the release of Falcon, so why now?

Llama and Falcon may become comparable with XGen(or not), and maybe this apporach will even help tearing down the 8k limit off XGen one day.

Saying we don't need this PR is somehow like giving up on Llama because of XGen.

@LostRuins
Copy link
Collaborator

@maddes8cht what ikawrakow says has some merit. By forcing RoPE to be scaled, you potentially lose quality on larger models that can handle longer contexts since everything gets forcibly compressed back down to a 2k window, so it is not an ideal general purpose solution.

@KerfuffleV2
Copy link
Collaborator

Haha, maybe I need to reopen #1967. I closed it in favor of this one. Shouldn't be too hard to adapt it to the pregenerated scale approach and add a few commandline options. (I'll look into that tomorrow if I get a chance.)

@ggerganov I just want to make sure I understood what you said before: If we were to change RoPE to take a scale array + scale array length options you'd want to just change ggml_rope and ggml_rope_inplace to take new arguments instead of creating a new ggml_rope_with_scale or something - even though this change would break all applications that use GGML?

@maddes8cht
Copy link
Contributor

@LostRuins As Llama DOES have only 2k trained context, it's still like giving up on Llama at all.

Citing @ikawrakow himself:

The main value of this PR is not the 20-30 lines code change, which was a matter of 5 minutes of effort. The PR shows how to correctly apply position scaling, which was missed by everybody, yourself included. Finding how to do it better compared to what was out there was where the 2-3 days effort went. Once this was done, it was clear that one needs the model training context size, which is a model hyper parameter that has been completely missed so far. Instead of embarking on a much bigger change to add this new hyper parameter to ggml/llama.cpp, I took the pragmatic approach of defining it via a preprocessor macro. This seemed appropriate, considering that the number of training context windows of models using RoPE and supported by llama.cpp is exactly one.

Isn't this about applying RoPE in Llama first, maybe inFalcon next, and then, maybe, sometimes, adapting it in a way it can be handled felxibly in ggml so that when ggml is more easily adaptable to different models, so that it can also cover something like XGen instead of having a seperate XGen.cpp project just as we have in falcon right now?

@SlyEcho
Copy link
Collaborator

SlyEcho commented Jun 29, 2023

What if it were a command line argument so the user could decide how to use the model, just like n_ctx is today?

@KerfuffleV2
Copy link
Collaborator

What if it were a command line argument so the user could decide how to use the model, just like n_ctx is today?

The thing is, it's not just a single number you can plug in. People have been playing with different functions for scaling it. So I think something like what I mentioned where you pass in the scale for each context length is the most flexible (any algorithm could generate it). Another way maybe would be to pass in a function pointer kind of like the map functions.

From a user perspective, you'd probably need to have a couple built in algorithms that could support different approaches to scaling (and it's possible different algorithms would be needed to scale other models like Falcon, etc). Then the user could select between them and possibly specify extra values if necessary, kind of like how choosing/configuring samplers works.

@JohannesGaessler
Copy link
Collaborator

I investigated minimum VRAM scratch requirements for 7b. It seems that the scratch buffer will need to increase linearly with context size:

Model Context size Min. VRAM scratch size [MiB] Delta [MiB]
7b q6_K 512 136 -
7b q6_K 1024 171 35
7b q6_K 1536 243 72
7b q6_K 2048 318 75
7b q6_K 2560 350 32
7b q6_K 3072 382 32
7b q6_K 3584 414 32
7b q6_K 4096 446 32
7b q6_K 4608 478 32
7b q6_K 5120 510 32
7b q6_K 5632 542 32
7b q6_K 6144 574 32
7b q6_K 6656 606 32
7b q6_K 7168 638 32
7b q6_K 7680 670 32
7b q6_K 8192 702 32

@ikawrakow
Copy link
Contributor Author

I'm on vacation, did not take my laptop with me, and it is a bit tedious to be answering on the phone. When I closed the PR earlier, I was expecting to get to type a comment before it is closed. This shows my age, having such unreasonable expectations on modern UI/UX. Anyhow, unless my brain has melted in the Mediterranean sun and as a result I did not understand the description, XGen-7B is a drop-in replacement for LLaMA. It offers models trained with 4k and 8k contexts. So, the PR is not good for that. Several alternatives were offered by people commenting, but each needs to completely modify this PR. If I take a command line option as an example, one needs to modify the signatures of ggml_rope and ggml_rope_inplace to pass the parameter, and then one needs to pass it to the actual compute kernel via opt, as @ggerganov suggested earlier. So, basically a very different PR.

It is an open source repo, so anyone can reopen and modify accordingly. I'm not doing it in the next 10 days while on vacation without my laptop.

@JohannesGaessler
Copy link
Collaborator

13b numbers:

Model Context size Min. VRAM scratch size [MiB] Delta [MiB]
13b q4_0 512 170 -
13b q4_0 1024 214 44
13b q4_0 1536 251 37
13b q4_0 2048 398 147
13b q4_0 2560 438 40
13b q4_0 3072 478 40
13b q4_0 3584 518 40
13b q4_0 4096 558 40
13b q4_0 4608 598 40
13b q4_0 5120 638 40
13b q4_0 5632 678 40
13b q4_0 6144 718 40
13b q4_0 6656 758 40
13b q4_0 7168 798 40
13b q4_0 7680 838 40
13b q4_0 8192 878 40

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request generation quality Quality of model output
Projects
None yet
Development

Successfully merging this pull request may close these issues.