-
Notifications
You must be signed in to change notification settings - Fork 11k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Modified RoPE with linear scaling #2019
Conversation
When the context size is greater than the maximum context size during training, scale the position given to RoPE with trainign context / n_ctx.
Did you try with higher context in larger model (for example 16k 13b)? |
No because, as mentioned in the PR, currently the CUDA implementation in |
If you are looking for numerical stability issues in the CUDA implementation, I would suggest looking at the soft max kernel. |
It is not a matter of numerical stability. It is a matter of buffer sizes not being estimated correctly for large context sizes and stuff in the scratch buffers being overwritten (or not being written at the right place). The size of the scratch buffers being incorrect also affects the CPU code. It is just that this happens somewhat later (larger context sizes) and
I get
When the code was developed context sizes greater than 2048 weren't a possibility, so to some extent it is not surprising that it does not work when the assumption |
That will be easier to fix then. I know that the soft max kernel produces infs and nans in F16, so it seemed to be the same issue. The way soft max is currently implemented is prone to numerical stability issues and should be fixed eventually. |
I built #1967 with openblas and had garbage result in fine turning 16k model. Might this give u some ideas in the cuda version? |
The calculation of scratch buffer size is unclear. The following works for 8k context. diff --git a/llama.cpp b/llama.cpp
index 1a15844..eb112a2 100644
--- a/llama.cpp
+++ b/llama.cpp
@@ -83,9 +83,9 @@ static const std::map<e_model, size_t> & MEM_REQ_SCRATCH0()
static std::map<e_model, size_t> k_sizes = {
{ MODEL_3B, 256ull * MB },
{ MODEL_7B, 512ull * MB },
- { MODEL_13B, 512ull * MB },
- { MODEL_30B, 512ull * MB },
- { MODEL_65B, 1024ull * MB },
+ { MODEL_13B, 1024ull * MB },
+ { MODEL_30B, 1024ull * MB },
+ { MODEL_65B, 2048ull * MB },
};
return k_sizes;
}
@@ -715,7 +715,7 @@ struct llama_model_loader {
*ctx_size_p = *mmapped_size_p = 0;
for (const llama_load_tensor & lt : tensors_map.tensors) {
*ctx_size_p += sizeof(struct ggml_tensor) + GGML_OBJECT_SIZE;
- *(use_mmap ? mmapped_size_p : ctx_size_p) += lt.size;
+ *(use_mmap ? mmapped_size_p : ctx_size_p) += lt.size + 16;
}
}
|
What is the effect on models other than LLaMA? |
If your use is within their respective training context size, as it has been until now, none. Otherwise, you try and see what happens. |
It's possible if a base llama or open-llama model was completely fine-tuned using this rope method (rather than just with this superhot lora) it might achieve better overall results and could possibly have no context ceiling beyond the realistic limitations of memory. |
@@ -72,6 +72,7 @@ set(LLAMA_CUDA_DMMV_X "32" CACHE STRING "llama: x stride for dmmv CUDA kern | |||
set(LLAMA_CUDA_DMMV_Y "1" CACHE STRING "llama: y block size for dmmv CUDA kernels") | |||
option(LLAMA_CUDA_DMMV_F16 "llama: use 16 bit floats for dmmv CUDA kernels" OFF) | |||
set(LLAMA_CUDA_KQUANTS_ITER "2" CACHE STRING "llama: iters./thread per block for Q2_K/Q6_K") | |||
set(LLAMA_TRAINIG_CTX "2176" CACHE STRING "llama: model training maximum context") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be LLAMA_TRAINING_CTX
?
@ikawrakow What do you think of this idea? #1965 (comment) Pass in an array of specific scales with an entry for each possible context length. It could even use the existing options subtensor for storage. This would allow trying out different approaches for scales and also other models in the future might use RoPE and benefit from this sort of thing but require a different scale than LLaMA. |
Sounds interesting. But isn't it better to first merge this and then try to refine? Btw, I have gone on vacation and will no be very responsive in the next 2 weeks |
Once this PR merged, how do I expect to use it? do I need to requantize the model? or just re make the llama.cpp and set n_ctx on the fly? |
I don't think we need the new defines. We should pass all parameters from the user code through the |
@ggerganov The main value of this PR is not the 20-30 lines code change, which was a matter of 5 minutes of effort. The PR shows how to correctly apply position scaling, which was missed by everybody, yourself included. Finding how to do it better compared to what was out there was where the 2-3 days effort went. Once this was done, it was clear that one needs the model training context size, which is a model hyper parameter that has been completely missed so far. Instead of embarking on a much bigger change to add this new hyper parameter to Concerning using |
No, no need to requantize. Just specify the context window to use via |
https://www.reddit.com/r/LocalLLaMA/comments/14lz7j5/ntkaware_scaled_rope_allows_llama_models_to_have/ |
Why close this PR? |
I was just about to start uploading dozens of SuperHOT GGML repos. KoboldCpp just added support for RoPE, based on this PR, so I was going to start uploading on that basis. But my READMEs were also going to link to this PR and tell users the feature was coming soon to llama.cpp. It'd be great to know what's happening with this PR and why it's been closed? |
Well, it works or it doesn't. |
Don't worry, I will. But I'd like to know what to tell people about upcoming llama.cpp support. I thought it was imminent but if it's not I'd like to tell people that it'll only work in KoboldCpp until further notice. |
Ouch, this is sad. Will there be a new improved PR? Some bad communication? I've been watching this thread since it's existence and was eagerly waiting for it to be merged. |
In light of today's XGen-7B announcement, having the training context length as a compile time option is not a good approach. That's why I closed this PR. A better approach will hopefully come soon. |
Ah OK. And you couldn't merge this now and then improve the implementation as a separate PR? That would allow people to try the SuperHOT GGMLs now, which many people would love to do. It's great that KoboldCpp supports it, but llama.cpp support will then also bring llama-cpp-python and ctransformers, opening it up to several more UIs, and Python code. PS. Am I right in assuming that XGen GGML will take a bit of work, given it requires a different tokeniser? |
@ikawrakow how about, instead of a compile time constant, assume that the training ctxlen is equal to the value in the config's XGen-7B has max_position_embeddings = 8192 |
XGen-7b is a complete different model. Llama and Falcon may become comparable with XGen(or not), and maybe this apporach will even help tearing down the 8k limit off XGen one day. Saying we don't need this PR is somehow like giving up on Llama because of XGen. |
@maddes8cht what ikawrakow says has some merit. By forcing RoPE to be scaled, you potentially lose quality on larger models that can handle longer contexts since everything gets forcibly compressed back down to a 2k window, so it is not an ideal general purpose solution. |
Haha, maybe I need to reopen #1967. I closed it in favor of this one. Shouldn't be too hard to adapt it to the pregenerated scale approach and add a few commandline options. (I'll look into that tomorrow if I get a chance.) @ggerganov I just want to make sure I understood what you said before: If we were to change RoPE to take a scale array + scale array length options you'd want to just change |
@LostRuins As Llama DOES have only 2k trained context, it's still like giving up on Llama at all. Citing @ikawrakow himself:
Isn't this about applying RoPE in Llama first, maybe inFalcon next, and then, maybe, sometimes, adapting it in a way it can be handled felxibly in ggml so that when ggml is more easily adaptable to different models, so that it can also cover something like XGen instead of having a seperate XGen.cpp project just as we have in falcon right now? |
What if it were a command line argument so the user could decide how to use the model, just like |
The thing is, it's not just a single number you can plug in. People have been playing with different functions for scaling it. So I think something like what I mentioned where you pass in the scale for each context length is the most flexible (any algorithm could generate it). Another way maybe would be to pass in a function pointer kind of like the map functions. From a user perspective, you'd probably need to have a couple built in algorithms that could support different approaches to scaling (and it's possible different algorithms would be needed to scale other models like Falcon, etc). Then the user could select between them and possibly specify extra values if necessary, kind of like how choosing/configuring samplers works. |
I investigated minimum VRAM scratch requirements for 7b. It seems that the scratch buffer will need to increase linearly with context size:
|
I'm on vacation, did not take my laptop with me, and it is a bit tedious to be answering on the phone. When I closed the PR earlier, I was expecting to get to type a comment before it is closed. This shows my age, having such unreasonable expectations on modern UI/UX. Anyhow, unless my brain has melted in the Mediterranean sun and as a result I did not understand the description, XGen-7B is a drop-in replacement for LLaMA. It offers models trained with 4k and 8k contexts. So, the PR is not good for that. Several alternatives were offered by people commenting, but each needs to completely modify this PR. If I take a command line option as an example, one needs to modify the signatures of It is an open source repo, so anyone can reopen and modify accordingly. I'm not doing it in the next 10 days while on vacation without my laptop. |
13b numbers:
|
This PR adds the ability to use context sizes greater than the training context size by applying linear scaling to the token position before applying the RoPE operation. The idea originally came from (https://github.com/kaiokendev). See also the discussion #1965. This can be used "out-of-the box" for context sizes of to, say, 3072. For even larger contexts one is most likely better off fine tuning the model as discussed in #1965 and elsewhere around the Internet (but perhaps fine tuning starting from this PR would give better results?).
PR #1967 is similar to this PR. The main difference is how the linear scaling is being used. In PR #1967 the positional scaling factor is a constant defined at compile time and therefore affects evaluation also when the context size being used is less than the maximum training context size. This leads to much higher perplexities for context sizes up to 2048 (the training context size of the LLaMA models), e.g., we get
ppl = 7
at a context size of 512 instead of theppl=5.9066
obtained without scaling. In this PR we define a compile time constant (LLAMA_TRAINIG_CTX
, which in turn definesGGML_TRAINING_CTX
) that is the training context size of the model being used. Further, we pass the current context sizen_ctx
to the RoPE operation when constructing the computational graph. When the graph is evaluated, we use the actual token position whenn_ctx <= GGML_TRAINING_CTX
, but scale the position withGGML_TRAINING_CTX / n_ctx
whenn_ctx > GGML_TRAINING_CTX
.Interestingly enough, it is better to set
GGML_TRAINING_CTX
to2176 (2048 + 128)
and not2048
(better in the sense that we get slightly lower perplexities).The following table gives a summary of perplexities obtained for the 7B LLaMA model for context sizes of up to 5120. It is currently not possible to run with
n_ctx > 5120
at 7B on CUDA (we get NaNs and @JohannesGaessler is looking into the issue).Q6_K
quantization is used, which is known to matchfp16
perplexity to within 0.1%.Interesting to note that we actually outperform the
n_ctx = 2048
result atn_ctx = 2560
, i.e, we do get some "free lunch" :-)The next table gives the results for 13B, where we currently can only go up to about
n_ctx = 3540
on CUDA before getting NaNs:Here the
n_ctx = 2048
result is outperformed up to 3072 tokens.For
n_ctx <= 3072
linear scaling works best. Forn_ctx > 3072
one can get slightly better results by using, e.g.,where
p0
is the original token position andp
is the token position given to RoPE. The constantsa_scale
andb_scale
that seem to work best are given by(with this being applied only when
n_ctx > GGML_TRAINING_CTX
). Using this approach I getppl = 5.6472
forn_ctx = 4096
and 7B. With the gain being quite modest compared to just linear scaling and the approach more complicated, I did not add this to the PR.