Modified RoPE with linear scaling #2019

ikawrakow · 2023-06-27T13:20:32Z

This PR adds the ability to use context sizes greater than the training context size by applying linear scaling to the token position before applying the RoPE operation. The idea originally came from (https://github.com/kaiokendev). See also the discussion #1965. This can be used "out-of-the box" for context sizes of to, say, 3072. For even larger contexts one is most likely better off fine tuning the model as discussed in #1965 and elsewhere around the Internet (but perhaps fine tuning starting from this PR would give better results?).

PR #1967 is similar to this PR. The main difference is how the linear scaling is being used. In PR #1967 the positional scaling factor is a constant defined at compile time and therefore affects evaluation also when the context size being used is less than the maximum training context size. This leads to much higher perplexities for context sizes up to 2048 (the training context size of the LLaMA models), e.g., we get ppl = 7 at a context size of 512 instead of the ppl=5.9066 obtained without scaling. In this PR we define a compile time constant (LLAMA_TRAINIG_CTX, which in turn defines GGML_TRAINING_CTX) that is the training context size of the model being used. Further, we pass the current context size n_ctx to the RoPE operation when constructing the computational graph. When the graph is evaluated, we use the actual token position when n_ctx <= GGML_TRAINING_CTX, but scale the position with GGML_TRAINING_CTX / n_ctx when n_ctx > GGML_TRAINING_CTX.

Interestingly enough, it is better to set GGML_TRAINING_CTX to 2176 (2048 + 128) and not 2048 (better in the sense that we get slightly lower perplexities).

The following table gives a summary of perplexities obtained for the 7B LLaMA model for context sizes of up to 5120. It is currently not possible to run with n_ctx > 5120 at 7B on CUDA (we get NaNs and @JohannesGaessler is looking into the issue). Q6_K quantization is used, which is known to match fp16 perplexity to within 0.1%.

n_ctx	Perplexity
1024	5.4351
2048	5.2855
2560	5.2232
3072	5.3003
3584	5.4725
4096	5.6743
4608	5.8487
5120	6.1153

Interesting to note that we actually outperform the n_ctx = 2048 result at n_ctx = 2560, i.e, we do get some "free lunch" :-)

The next table gives the results for 13B, where we currently can only go up to about n_ctx = 3540 on CUDA before getting NaNs:

n_ctx	Perplexity
2048	4.7094
2560	4.6459
3072	4.6868
3584	4.8135

Here the n_ctx = 2048 result is outperformed up to 3072 tokens.

For n_ctx <= 3072 linear scaling works best. For n_ctx > 3072 one can get slightly better results by using, e.g.,

p = p0 * a_scale / (1 + p0 / b_scale)

where p0 is the original token position and p is the token position given to RoPE. The constants a_scale and b_scale that seem to work best are given by

a_scale = sqrtf(GGML_TRAINING_CTX / n_ctx)
b_scale = a_scale * n_ctx / (1 - a_scale)

(with this being applied only when n_ctx > GGML_TRAINING_CTX). Using this approach I get ppl = 5.6472 for n_ctx = 4096 and 7B. With the gain being quite modest compared to just linear scaling and the approach more complicated, I did not add this to the PR.

When the context size is greater than the maximum context size during training, scale the position given to RoPE with trainign context / n_ctx.

FNsi · 2023-06-27T13:35:21Z

Did you try with higher context in larger model (for example 16k 13b)?

ikawrakow · 2023-06-27T13:42:31Z

Did you try with higher context in 13b like 16384?

No because, as mentioned in the PR, currently the CUDA implementation in llama.cpp does not work beyond 3.5k for 13B and I don't have the patience to run it on the CPU. We will try and refine once the CUDA issue has been fixed.

slaren · 2023-06-27T13:45:31Z

If you are looking for numerical stability issues in the CUDA implementation, I would suggest looking at the soft max kernel.

ikawrakow · 2023-06-27T13:55:09Z

If you are looking for numerical stability issues in the CUDA implementation, I would suggest looking at the soft max kernel.

It is not a matter of numerical stability. It is a matter of buffer sizes not being estimated correctly for large context sizes and stuff in the scratch buffers being overwritten (or not being written at the right place). The size of the scratch buffers being incorrect also affects the CPU code. It is just that this happens somewhat later (larger context sizes) and llama.cpp stops with an assert instead of producing NaNs. E.g., if I run with a 7B model

./bin/perplexity -m q6k.bin -f ../tests/wikitext-2-raw/wiki.test.raw -s 1234 -t 16 -c 8192

I get

ggml_new_tensor_impl: not enough space in the scratch memory pool (needed 545259520, available 536870912)

When the code was developed context sizes greater than 2048 weren't a possibility, so to some extent it is not surprising that it does not work when the assumption n_ctx <= 2048 is no longer true.

slaren · 2023-06-27T14:00:46Z

That will be easier to fix then. I know that the soft max kernel produces infs and nans in F16, so it seemed to be the same issue. The way soft max is currently implemented is prone to numerical stability issues and should be fixed eventually.

FNsi · 2023-06-27T14:01:44Z

I built #1967 with openblas and had garbage result in fine turning 16k model.

Might this give u some ideas in the cuda version?

jxy · 2023-06-27T15:22:49Z

The calculation of scratch buffer size is unclear. The following works for 8k context.

diff --git a/llama.cpp b/llama.cpp
index 1a15844..eb112a2 100644
--- a/llama.cpp
+++ b/llama.cpp
@@ -83,9 +83,9 @@ static const std::map<e_model, size_t> & MEM_REQ_SCRATCH0()
     static std::map<e_model, size_t> k_sizes = {
         { MODEL_3B,    256ull * MB },
         { MODEL_7B,    512ull * MB },
-        { MODEL_13B,   512ull * MB },
-        { MODEL_30B,   512ull * MB },
-        { MODEL_65B,  1024ull * MB },
+        { MODEL_13B,  1024ull * MB },
+        { MODEL_30B,  1024ull * MB },
+        { MODEL_65B,  2048ull * MB },
     };
     return k_sizes;
 }
@@ -715,7 +715,7 @@ struct llama_model_loader {
         *ctx_size_p = *mmapped_size_p = 0;
         for (const llama_load_tensor & lt : tensors_map.tensors) {
             *ctx_size_p += sizeof(struct ggml_tensor) + GGML_OBJECT_SIZE;
-            *(use_mmap ? mmapped_size_p : ctx_size_p) += lt.size;
+            *(use_mmap ? mmapped_size_p : ctx_size_p) += lt.size + 16;
         }
     }

SlyEcho · 2023-06-27T15:42:10Z

What is the effect on models other than LLaMA?

ggml.c

ikawrakow · 2023-06-27T16:11:40Z

What is the effect on models other than LLaMA?

If your use is within their respective training context size, as it has been until now, none. Otherwise, you try and see what happens.

Midaychi · 2023-06-27T22:59:09Z

It's possible if a base llama or open-llama model was completely fine-tuned using this rope method (rather than just with this superhot lora) it might achieve better overall results and could possibly have no context ceiling beyond the realistic limitations of memory.

KerfuffleV2 · 2023-06-28T09:15:49Z

CMakeLists.txt

@@ -72,6 +72,7 @@ set(LLAMA_CUDA_DMMV_X      "32" CACHE STRING "llama: x stride for dmmv CUDA kern
 set(LLAMA_CUDA_DMMV_Y       "1" CACHE STRING "llama: y block size for dmmv CUDA kernels")
 option(LLAMA_CUDA_DMMV_F16                   "llama: use 16 bit floats for dmmv CUDA kernels"   OFF)
 set(LLAMA_CUDA_KQUANTS_ITER "2" CACHE STRING "llama: iters./thread per block for Q2_K/Q6_K")
+set(LLAMA_TRAINIG_CTX    "2176" CACHE STRING "llama: model training maximum context")


Should this be LLAMA_TRAINING_CTX?

KerfuffleV2 · 2023-06-28T09:31:25Z

@ikawrakow What do you think of this idea? #1965 (comment)

Pass in an array of specific scales with an entry for each possible context length. It could even use the existing options subtensor for storage.

This would allow trying out different approaches for scales and also other models in the future might use RoPE and benefit from this sort of thing but require a different scale than LLaMA.

ikawrakow · 2023-06-28T11:36:22Z

Sounds interesting. But isn't it better to first merge this and then try to refine? Btw, I have gone on vacation and will no be very responsive in the next 2 weeks

JianbangZ · 2023-06-28T15:34:47Z

Once this PR merged, how do I expect to use it? do I need to requantize the model? or just re make the llama.cpp and set n_ctx on the fly?

ggerganov · 2023-06-28T18:04:30Z

I don't think we need the new defines. We should pass all parameters from the user code through the opt tensor of ggml_rope(). We can also utilize the mode parameter to support both the original interpolation method from #1967 and the one proposed here (and potentially new ones proposed in the future)

ikawrakow · 2023-06-29T09:28:04Z

@ggerganov The main value of this PR is not the 20-30 lines code change, which was a matter of 5 minutes of effort. The PR shows how to correctly apply position scaling, which was missed by everybody, yourself included. Finding how to do it better compared to what was out there was where the 2-3 days effort went. Once this was done, it was clear that one needs the model training context size, which is a model hyper parameter that has been completely missed so far. Instead of embarking on a much bigger change to add this new hyper parameter to ggml/llama.cpp, I took the pragmatic approach of defining it via a preprocessor macro. This seemed appropriate, considering that the number of training context windows of models using RoPE and supported by llama.cpp is exactly one.

Concerning using mode: Yes I considered that, but decided not to add a new mode. Why? Very simple: what is the point of keeping mode 0 unmodified and adding a new mode in view of a) mode 0 being identical to the new mode for contexts less or equal to the training context size and b) mode 0 completely falling apart for contexts greater than the training context window? Basically, you can see this PR as fixing mode 0 by letting it extend in a reasonable way where it failed completely before, while keeping it's behavior unmodified where the original version works

ikawrakow · 2023-06-29T09:32:34Z

Once this PR merged, how do I expect to use it? do I need to requantize the model? or just re make the llama.cpp and set n_ctx on the fly?

No, no need to requantize. Just specify the context window to use via -c on the command line.

Midaychi · 2023-06-29T09:41:06Z

https://www.reddit.com/r/LocalLLaMA/comments/14lz7j5/ntkaware_scaled_rope_allows_llama_models_to_have/
This was posted not too long ago; is this helpful?

LostRuins · 2023-06-29T15:25:47Z

Why close this PR?

TheBloke · 2023-06-29T15:34:46Z

I was just about to start uploading dozens of SuperHOT GGML repos.

KoboldCpp just added support for RoPE, based on this PR, so I was going to start uploading on that basis. But my READMEs were also going to link to this PR and tell users the feature was coming soon to llama.cpp.

It'd be great to know what's happening with this PR and why it's been closed?

Nexesenex · 2023-06-29T15:44:23Z

Well, it works or it doesn't.
Upload at least one model Tom, please, so we can test it on the last KoboldCpp !

TheBloke · 2023-06-29T15:46:11Z

Don't worry, I will.

But I'd like to know what to tell people about upcoming llama.cpp support. I thought it was imminent but if it's not I'd like to tell people that it'll only work in KoboldCpp until further notice.

maddes8cht · 2023-06-29T15:46:22Z

Ouch, this is sad.

Will there be a new improved PR?
What's going on?

Some bad communication?

I've been watching this thread since it's existence and was eagerly waiting for it to be merged.

ikawrakow · 2023-06-29T16:06:03Z

In light of today's XGen-7B announcement, having the training context length as a compile time option is not a good approach. That's why I closed this PR. A better approach will hopefully come soon.

TheBloke · 2023-06-29T16:18:08Z

Ah OK.

And you couldn't merge this now and then improve the implementation as a separate PR?

That would allow people to try the SuperHOT GGMLs now, which many people would love to do. It's great that KoboldCpp supports it, but llama.cpp support will then also bring llama-cpp-python and ctransformers, opening it up to several more UIs, and Python code.

PS. Am I right in assuming that XGen GGML will take a bit of work, given it requires a different tokeniser?

LostRuins · 2023-06-29T16:26:11Z

@ikawrakow how about, instead of a compile time constant, assume that the training ctxlen is equal to the value in the config's max_position_embeddings which GGML can read? Is that a valid assumption?

XGen-7B has max_position_embeddings = 8192
a typical llama model has max_position_embeddings = 2048

maddes8cht · 2023-06-29T16:32:05Z

In light of today's XGen-7B announcement, having the training context length as a compile time option is not a good approach. That's why I closed this PR. A better approach will hopefully come soon.

XGen-7b is a complete different model.
This doesn't make any sense at all.
It's like proposing we don't need llama anymore because of XGen.
We didn't say that with the release of Falcon, so why now?

Llama and Falcon may become comparable with XGen(or not), and maybe this apporach will even help tearing down the 8k limit off XGen one day.

Saying we don't need this PR is somehow like giving up on Llama because of XGen.

LostRuins · 2023-06-29T16:41:08Z

@maddes8cht what ikawrakow says has some merit. By forcing RoPE to be scaled, you potentially lose quality on larger models that can handle longer contexts since everything gets forcibly compressed back down to a 2k window, so it is not an ideal general purpose solution.

KerfuffleV2 · 2023-06-29T16:42:13Z

Haha, maybe I need to reopen #1967. I closed it in favor of this one. Shouldn't be too hard to adapt it to the pregenerated scale approach and add a few commandline options. (I'll look into that tomorrow if I get a chance.)

@ggerganov I just want to make sure I understood what you said before: If we were to change RoPE to take a scale array + scale array length options you'd want to just change ggml_rope and ggml_rope_inplace to take new arguments instead of creating a new ggml_rope_with_scale or something - even though this change would break all applications that use GGML?

maddes8cht · 2023-06-29T16:53:14Z

@LostRuins As Llama DOES have only 2k trained context, it's still like giving up on Llama at all.

Citing @ikawrakow himself:

The main value of this PR is not the 20-30 lines code change, which was a matter of 5 minutes of effort. The PR shows how to correctly apply position scaling, which was missed by everybody, yourself included. Finding how to do it better compared to what was out there was where the 2-3 days effort went. Once this was done, it was clear that one needs the model training context size, which is a model hyper parameter that has been completely missed so far. Instead of embarking on a much bigger change to add this new hyper parameter to ggml/llama.cpp, I took the pragmatic approach of defining it via a preprocessor macro. This seemed appropriate, considering that the number of training context windows of models using RoPE and supported by llama.cpp is exactly one.

Isn't this about applying RoPE in Llama first, maybe inFalcon next, and then, maybe, sometimes, adapting it in a way it can be handled felxibly in ggml so that when ggml is more easily adaptable to different models, so that it can also cover something like XGen instead of having a seperate XGen.cpp project just as we have in falcon right now?

SlyEcho · 2023-06-29T16:58:35Z

What if it were a command line argument so the user could decide how to use the model, just like n_ctx is today?

KerfuffleV2 · 2023-06-29T17:22:56Z

What if it were a command line argument so the user could decide how to use the model, just like n_ctx is today?

The thing is, it's not just a single number you can plug in. People have been playing with different functions for scaling it. So I think something like what I mentioned where you pass in the scale for each context length is the most flexible (any algorithm could generate it). Another way maybe would be to pass in a function pointer kind of like the map functions.

From a user perspective, you'd probably need to have a couple built in algorithms that could support different approaches to scaling (and it's possible different algorithms would be needed to scale other models like Falcon, etc). Then the user could select between them and possibly specify extra values if necessary, kind of like how choosing/configuring samplers works.

JohannesGaessler · 2023-06-29T19:07:06Z

I investigated minimum VRAM scratch requirements for 7b. It seems that the scratch buffer will need to increase linearly with context size:

Model	Context size	Min. VRAM scratch size [MiB]	Delta [MiB]
7b q6_K	512	136	-
7b q6_K	1024	171	35
7b q6_K	1536	243	72
7b q6_K	2048	318	75
7b q6_K	2560	350	32
7b q6_K	3072	382	32
7b q6_K	3584	414	32
7b q6_K	4096	446	32
7b q6_K	4608	478	32
7b q6_K	5120	510	32
7b q6_K	5632	542	32
7b q6_K	6144	574	32
7b q6_K	6656	606	32
7b q6_K	7168	638	32
7b q6_K	7680	670	32
7b q6_K	8192	702	32

ikawrakow · 2023-06-29T20:26:42Z

I'm on vacation, did not take my laptop with me, and it is a bit tedious to be answering on the phone. When I closed the PR earlier, I was expecting to get to type a comment before it is closed. This shows my age, having such unreasonable expectations on modern UI/UX. Anyhow, unless my brain has melted in the Mediterranean sun and as a result I did not understand the description, XGen-7B is a drop-in replacement for LLaMA. It offers models trained with 4k and 8k contexts. So, the PR is not good for that. Several alternatives were offered by people commenting, but each needs to completely modify this PR. If I take a command line option as an example, one needs to modify the signatures of ggml_rope and ggml_rope_inplace to pass the parameter, and then one needs to pass it to the actual compute kernel via opt, as @ggerganov suggested earlier. So, basically a very different PR.

It is an open source repo, so anyone can reopen and modify accordingly. I'm not doing it in the next 10 days while on vacation without my laptop.

JohannesGaessler · 2023-06-29T21:00:00Z

13b numbers:

Model	Context size	Min. VRAM scratch size [MiB]	Delta [MiB]
13b q4_0	512	170	-
13b q4_0	1024	214	44
13b q4_0	1536	251	37
13b q4_0	2048	398	147
13b q4_0	2560	438	40
13b q4_0	3072	478	40
13b q4_0	3584	518	40
13b q4_0	4096	558	40
13b q4_0	4608	598	40
13b q4_0	5120	638	40
13b q4_0	5632	678	40
13b q4_0	6144	718	40
13b q4_0	6656	758	40
13b q4_0	7168	798	40
13b q4_0	7680	838	40
13b q4_0	8192	878	40

Modified RoPE with linear scaling

cda3003

When the context size is greater than the maximum context size during training, scale the position given to RoPE with trainign context / n_ctx.

ikawrakow requested a review from ggerganov June 27, 2023 13:20

KerfuffleV2 mentioned this pull request Jun 27, 2023

Allow specifying p scale factor for ggml rope and rope_back ops #1967

Closed

jxy reviewed Jun 27, 2023

View reviewed changes

ggml.c Outdated Show resolved Hide resolved

Fixed typo

333c40b

SlyEcho added enhancement New feature or request generation quality Quality of model output labels Jun 27, 2023

KerfuffleV2 reviewed Jun 28, 2023

View reviewed changes

ikawrakow closed this Jun 29, 2023

JohannesGaessler mentioned this pull request Jun 30, 2023

Test-based VRAM scratch size + context adjustment #2056

Merged

philpax mentioned this pull request Jul 17, 2023

Implement SuperHOT/interpolated RoPE support rustformers/llm#378

Closed

ikawrakow deleted the ik/context_extend branch September 24, 2023 16:09

Modified RoPE with linear scaling #2019

Modified RoPE with linear scaling #2019

Conversation

ikawrakow commented Jun 27, 2023

FNsi commented Jun 27, 2023 • edited Loading

ikawrakow commented Jun 27, 2023 • edited Loading

slaren commented Jun 27, 2023

ikawrakow commented Jun 27, 2023

slaren commented Jun 27, 2023

FNsi commented Jun 27, 2023

jxy commented Jun 27, 2023

SlyEcho commented Jun 27, 2023

ikawrakow commented Jun 27, 2023

Midaychi commented Jun 27, 2023

KerfuffleV2 Jun 28, 2023

Choose a reason for hiding this comment

KerfuffleV2 commented Jun 28, 2023

ikawrakow commented Jun 28, 2023

JianbangZ commented Jun 28, 2023

ggerganov commented Jun 28, 2023

ikawrakow commented Jun 29, 2023

ikawrakow commented Jun 29, 2023

Midaychi commented Jun 29, 2023

LostRuins commented Jun 29, 2023

TheBloke commented Jun 29, 2023 • edited Loading

Nexesenex commented Jun 29, 2023

TheBloke commented Jun 29, 2023

maddes8cht commented Jun 29, 2023 • edited Loading

ikawrakow commented Jun 29, 2023

TheBloke commented Jun 29, 2023

LostRuins commented Jun 29, 2023

maddes8cht commented Jun 29, 2023 • edited Loading

LostRuins commented Jun 29, 2023

KerfuffleV2 commented Jun 29, 2023

maddes8cht commented Jun 29, 2023

SlyEcho commented Jun 29, 2023

KerfuffleV2 commented Jun 29, 2023

JohannesGaessler commented Jun 29, 2023

ikawrakow commented Jun 29, 2023

JohannesGaessler commented Jun 29, 2023

FNsi commented Jun 27, 2023 •

edited

Loading

ikawrakow commented Jun 27, 2023 •

edited

Loading

TheBloke commented Jun 29, 2023 •

edited

Loading

maddes8cht commented Jun 29, 2023 •

edited

Loading

maddes8cht commented Jun 29, 2023 •

edited

Loading