Landmark Attention support #60

grimulkan · 2023-06-12T08:35:13Z

grimulkan
Jun 12, 2023

Any thoughts on how difficult it would be to support inference on a model trained with landmark attention? Like Minotaur, Wizard or the base Llama landmark finetunes released recently, and I suppose more will come out, now that multiple repos support lora/qlora/gptq-lora training with landmark attention.

I haven’t compared results yet, but it sounds like landmark attention should be more effective with long contexts compared to the turboderp/alpaca_lora_4bit repo. Like the author, I found that that repo did “something”, and stopped generating gibberish beyond 2048 at least, but I’m not sure what the model learned. The landmark attention paper claims it can solve needle-haystack problems beyond the context length, which I couldn’t get the previous method to do.

Landmark apparently works with Oogabooga with remote code enabled.

turboderp · 2023-06-12T09:34:55Z

turboderp
Jun 12, 2023
Maintainer

The landmark attention paper claims it can solve needle-haystack problems beyond the context length, which I couldn’t get the previous method to do.

It does, but there are some issues with it:

I've yet to see a convincing application for it that sets it apart from vector databases or other more established techniques. You can find a needle in a haystack, but that's not the hard part, and it's not clear when it's useful outside of a test to see if you can find a needle in a haystack.
Landmark attention doesn't actually extend the useful context of a Llama model. The authors make this clear enough in the paper, even though in the abstract they call their method a "close approximation" of long-range attention. Rather, what the landmarks end up being is essentially a set of keys for a vector database built continually out of the longer context. During inference, a shorter context is constructed from the blocks whose landmarks received the most attention. So, it's still just retrieval, or at best a new take on retrieval. It's not really long-range attention.
The entire (short) context needs to be reevaluated for every new token because even with the landmark finetuning, Llama's positional encoding scheme is still limited to 2048 tokens. That means past keys and values can't be reused for subsequent inferences as they will be evaluated with different positional encodings for each new step, and the effects of that cascade through every layer of the model. At best you could cache the value projections in the first layer for a marginal speedup compared to no cache at all. So at least in the way the method extends pretrained models it's inherently slow. Maybe it could be re-imagined and re-engineered, but idk.
They frame the whole exercise as an attempt to get around the computational complexity of attention, but that's not what we actually need. Attention isn't that complex. In the single-token case it's linear, both in computation and memory requirements, not quadratic. If the model were capable of understanding the positional encodings for a long context you could easily go to 20k tokens for Llama-7B on a 24 GB GPU and still get useful speeds.

All in all I'm not keen on taking on the work of making this method fast (somehow, if it's even possible), until I've actually seen it being useful. And it would be a considerable rewrite. Enabling remote code isn't really an option because that remote code is meant to be injected into Transformers.

0 replies

Ph0rk0z · 2023-06-12T12:21:33Z

Ph0rk0z
Jun 12, 2023

In my experience from merging the landmark lora to 13b models. So far the responses have been rather meh but coherent past 2048. The speed is terrible. A 65b replies faster than a 13b.

Currently it's a fast way to stop models from going nuts over 2048. You just have to merge the lora and encode it to GPTQ. Whatever that merge of MPT storywriter and wizard did was way better.

0 replies

grimulkan · 2023-06-12T16:48:16Z

grimulkan
Jun 12, 2023
Author

If it's just a fast way to stop models from outputting nonsense beyond 2048 by merging a LORA, maybe turboderp/alpaca_lora_4bit is just as good, and is fast with exllama's current approach (I think).

I haven't tried inference for the author's method outside the repo itself (turboderp/alpaca_lora_4bit): were special inference changes required, and are they already part of exllama? Would textgen need anything other than simply increasing the context length slider for inference using that method (though speed/VRAM would suck)?

I wish axiotl or the original alpaca_lora_4bit merged the modifications from turboderp/alpaca_lora_4bit. It'd make that method more popular, and with the right training data, maybe it actually works?

I eyeballed the loss from turboderp/alpaca_lora_4bit to see if I can get it to go lower with the bigger context while training. While the model can learn not to make it worse than 2048 context, it doesn't learn how to make it go lower. I tried feeding it obvious needle-haystack problems like the landmark attention paper does, but it didn't pick up on the correlation in my attempts. In theory though, I imagine it should, with enough data. Or maybe LORA isn't good enough for this.

Rather, what the landmarks end up being is essentially a set of keys for a vector database built continually out of the longer context. During inference, a shorter context is constructed from the blocks whose landmarks received the most attention. So, it's still just retrieval, or at best a new take on retrieval. It's not really long-range attention.

Wouldn't that still be better than looking up a vector database outside the LLM (egs., using sentence-based embeddings) and inserting the relevant text that into the LLM context? I think this is what current search methods like superbooga do. At least in the landmark approach they use attention to figure relevance out.

The entire (short) context needs to be reevaluated for every new token because even with the landmark finetuning, Llama's positional encoding scheme is still limited to 2048 tokens.

Ah, that sucks.

0 replies

turboderp · 2023-06-12T17:48:28Z

turboderp
Jun 12, 2023
Maintainer

I haven't tried inference for the author's method outside the repo itself (turboderp/alpaca_lora_4bit): were special inference changes required, and are they already part of exllama?

The only change I made for inference in that repo was to use a preallocated cache, only I did it in a very hacky and breakable way to make it work with Transformers. ExLlama also uses the same kind of preallocated cache, but more elegantly. So it supports very long contexts as long as the model knows what to do with positional embeddings after position 2048.

As for that, though, I haven't gotten to LoRA support just yet. But if the output of that LoRA repo were merged into the FP16 base model, and then the combined model was quantized again, I don't see why it wouldn't work with ExLlama.

But I will be adding it soon, anyway. It's not really complicated. It's just that with QLoRA and GPTQ-LoRA coming out, I'm hesitating because I'm not sure what I should support first. It kind of depends what ends up being most popular. The downside of not building on Transformers is losing that plug-and-play compatibility with every new thing that comes out, so I have to prioritize with these things. And next on the list is multi-GPU matmuls that might give a big boost to 65B models on dual GPUs (fingers crossed). I'd like to get to 30 tokens/second at least.

While the model can learn not to make it worse than 2048 context, it doesn't learn how to make it go lower.

Yep. But I'm not sure if that changes eventually. It might simply come in phases, i.e. it might be easy to learn to ignore the distant past because the contribution from those past tokens is so disruptive, and then take longer for the much more subtle positive contribution to start showing.

It's really hard to say for certain, or maybe my understanding of ML just isn't deep enough, but it wouldn't surprise me if it would take more training than what originally went into the base model. After all, comparing 7B to 33B it's very clear that 33B is much better at long-range relationships even within the 2k context that both models work with. That could be because it has more layers and there's a convolutional aspect to how it "scans" the context. Or long-range relationships are inherently more complex and require more "circuitry" to comprehend.

Wouldn't that still be better than looking up a vector database outside the LLM (egs., using sentence-based embeddings) and inserting the relevant text that into the LLM context? I think this is what current search methods like superbooga do. At least in the landmark approach they use attention to figure relevance out.

Well, attention is still ultimately just a dot-product similarity search. I can see how using the same conditions for the search as is used for token-to-token attention could result in more relevant matches, but that still only makes it a more accurate vector database.

I do hope to see some promising results from it, though. There might be ways to speed it up, if it's actually a good method. I would like to see comparisons of perplexity vs context length, for instance. It should show predictions getting better the more context it has for any given example, without plateauing at 2048 tokens.

0 replies

grimulkan · 2023-06-12T19:37:51Z

grimulkan
Jun 12, 2023
Author

And next on the list is multi-GPU matmuls that might give a big boost to 65B models on dual GPUs (fingers crossed). I'd like to get to 30 tokens/second at least.

That's the main reason I use your repo right now BTW, so very much looking forward to it! All other inference methods I've seen so far suck when you start splitting.

It might simply come in phases, i.e. it might be easy to learn to ignore the distant past because the contribution from those past tokens is so disruptive, and then take longer for the much more subtle positive contribution to start showing.
It should show predictions getting better the more context it has for any given example, without plateauing at 2048 tokens.

I want to keep trying to train it. I'll try landmark attention for reference too. From what you said, I should avoid the temptation to truncate and compare at less than 2048 context. The true test is crossing the limit of the normal positional encoder.

After all, comparing 7B to 33B it's very clear that 33B is much better at long-range relationships even within the 2k context that both models work with. That could be because it has more layers and there's a convolutional aspect to how it "scans" the context. Or long-range relationships are inherently more complex and require more "circuitry" to comprehend.

IIRC 33B and up was also trained on 40% more data than 13B and below. Maybe there is generally poor long-range context info in normal text data on the internet, so it needs both quantity and compute to make use of it. I would imagine cherry picking long conversations from shareGPT/Vicuna would guarantee some patterns beyond 2048 (but less than 4096 if using ChatGPT as a base), at least as a start. Maybe there is some good way to curate data with long-range context (other than deliberate needle-haystack).

Anyway, I think I have the answer for supporting landmark attention in this repo, at least for now. Thanks.

0 replies

turboderp · 2023-06-12T20:12:59Z

turboderp
Jun 12, 2023
Maintainer

IIRC 33B and up was also trained on 40% more data than 13B and below. Maybe there is generally poor long-range context info in normal text data on the internet, so it needs both quantity and compute to make use of it.

My thinking is that any given text will contain more short-range information than long-range information. This is why, if you start training a transformer from scratch and periodically check in on what it's able to generate as it's training, you'll see that it starts by learning how to arrange tokens in pairs or triplets to form correct words. Then it starts stringing related words together but without really saying anything. Eventually you start to see full sentences, and then later on sequences of sentences that build on one another. It's a clear progression from short-range to long-range attention.

One thought I had would be to force it to do long-range attention by only giving it long-range information. You could just tweak the position IDs to spread the examples out. This would speed up training too, since you could pass 2048 tokens but they would look to the model as, say, the beginning and end of an 8192-token sequence, with nothing in the middle for it to get confused by. I'm thinking maybe you start by adding a little space randomly between sentences, and then gradually increase it. And maybe eventually you switch to full-length examples. Idk. It's definitely on my list of things to try.

0 replies

grimulkan · 2023-06-12T20:25:44Z

grimulkan
Jun 12, 2023
Author

One thought I had would be to force it to do long-range attention by only giving it long-range information. You could just tweak the position IDs to spread the examples out. This would speed up training too, since you could pass 2048 tokens but they would look to the model as, say, the beginning and end of an 8192-token sequence, with nothing in the middle for it to get confused by. I'm thinking maybe you start by adding a little space randomly between sentences, and then gradually increase it.

That's a neat idea. I was thinking of randomly adding random text of varying lengths after the actual query (for instruct-tuning). Or randomize the order of all conversation items before the most recent one in a conversation chain. But if we just offset the positional data somehow, that's way cooler. It's on-the-fly data augmentation for context training.

0 replies

Ph0rk0z · 2023-06-13T13:15:09Z

Ph0rk0z
Jun 13, 2023

I have not tried to train a landmark attention with normal 4-bit lora (eg. alpaca_4bit). It would certainly go faster but it would train only 2 layers vs all of them Are all the layers needed to be trained with those landmark tokens? Would it be faster because less layers are modified? Or would it just not work.

Merging adapters to all layers certainly makes inference slower. So not sure what happens if you only train the 2 at 4-bit and then merge it to an FP16 model.

0 replies

grimulkan · 2023-06-13T20:30:01Z

grimulkan
Jun 13, 2023
Author

I almost always train all layers if I can. For landmark, try and see? I think the 2 default layers in alpaca_4bit is where the attention to landmark tokens happens though, could be wrong.

alpaca_4bit does a few odd things like defaulting to short context, default training only 2 layers, not using a rolling context (which further discourages long context learning imo), but you can revert all that without too much effort.

Merging adapters to all layers certainly makes inference slower.

Is that a landmark-related statement? I guess I normally always merge to fp16 and requantize and don't notice a performance difference (non-landmark).

Edit: There is a 33b finetune on red pajama dataset plus landmark now, should give it a try. Though only 200 liters so far.

0 replies

turboderp · 2023-06-13T20:45:46Z

turboderp
Jun 13, 2023
Maintainer

alpaca_4bit does a few odd things like defaulting to short context, default training only 2 layers, not using a rolling context (which further discourages long context learning imo), but you can revert all that without too much effort.

It just follows the Stanford Alpaca paper. The authors chose those two layers after some experimentation, I believe, determining that they were by far the most important for the results they were trying to achieve. And the point was partially to show how few extra parameters they had to train (about 10 million for 7B as I recall) to turn Llama into a pretty capable instruction-following model.

The short context length I think was also suitable for the short examples in the original Alpaca question/answer set. But of course it needs to be adjusted to fit whatever dataset is being applied. And especially if it's for a model that's meant to do a back-and-forth chat with a user it really needs to use the full context length.

Is that a landmark-related statement?

I guess that refers to which layers of each decoder you attach adapters to (i.e. merging the computation at runtime). If you merge them into the base model's weights there's no penalty for adapting more layers.

0 replies

grimulkan · 2023-06-13T22:47:57Z

grimulkan
Jun 13, 2023
Author

The short context length I think was also suitable for the short examples in the original Alpaca question/answer set. But of course it needs to be adjusted to fit whatever dataset is being applied. And especially if it's for a model that's meant to do a back-and-forth chat with a user it really needs to use the full context length.

It has always bugged me that even in repos designed for ShatGPT/Vicuna they chunk conversations into 2048 chunks and train on each chunk (or train on the assistant responses for that chunk). That means the responses closer to the top of the chunk get less context, which is not ideal if there was indeed prior conversation to go on. That's what I meant by "rolling" window. It means I'm always trying to train/compute loss only over the last 500 tokens or so of a 2048 chunk (the latest response). Of course, this really increases training time. For all I know it is placebo. Haven't done A/B tests.

0 replies

turboderp · 2023-06-14T00:13:32Z

turboderp
Jun 14, 2023
Maintainer

I don't know if that makes sense. It might. Rotary positional encodings are supposed to generalize better than the sinusoidal encodings from vanilla transformers, but they're still computed from the absolute position in the sequence. And they apparently don't generalize beyond the 2048 tokens of the base model, without tuning. So I'd be worried about the model forgetting what to do with positions 0-1500 if it only ever sees 1500-2000 in finetuning. Then again... idk.

0 replies

grimulkan · 2023-06-14T00:23:17Z

grimulkan
Jun 14, 2023
Author

Well I'd argue in a long conversation you are only generating the last bit of context most of the time anyway (at least for current context sizes which don't exceed conversation lengths), so maybe it makes sense to have a specialized lora. You can throw in alpaca style lora/data for training the front-end. Of course, all speculation if it even matters.

0 replies

Ph0rk0z · 2023-06-14T11:36:26Z

Ph0rk0z
Jun 14, 2023

Well if there is no penalty after merging, then landmark is very very slow no matter what. I have just been merging the lora that was released to other llama models and it has worked for the most part.

Haven't narrowed down what exactly has to be swapped over besides config.json and the scripts. For llama I used all the entire tokenizer/config files set.

The motivation hasn't been great because the replies were short despite the model handling the extra context and the speed was slow.

0 replies

turboderp · 2023-06-14T12:23:18Z

turboderp
Jun 14, 2023
Maintainer

Well, I mean, here's the code for the modified attention function. It's dense and would take a while for me to parse, but it's clearly doing a lot of extra work on the context.

From the paper I also can't see how they could get around re-evaluating the entire context for every token generated, since the position embeddings have to change depending on which blocks receive attention at any given moment. And if that weren't necessary, i.e. if the base model understood position embeddings spanning the whole sequence, then landmark attention wouldn't really be needed in the first place.

So it just seems like a really slow method. They've added a Triton attention kernel that might speed things up a bit, not sure if you're using that. But I wouldn't expect miracles from that either.

0 replies

Ph0rk0z · 2023-06-15T11:00:03Z

Ph0rk0z
Jun 15, 2023

Trition was super hyped up, but every time I've tried it, results have been worse than normal cuda kernels. Plus they don't support pre-volta and that knocks one of my systems out from the start.

Makes perfect sense though. If they reprocess everything, every time. Not even dual 3090 can be fast with that. I think landmark attention is a good try, but it's not going to be more than a summarizer.

0 replies

grimulkan · 2023-06-15T16:59:52Z

grimulkan
Jun 15, 2023
Author

I'm curious to see if Landmark can beat standard vector summarization like superbooga, etc., i.e., how good the joint attention-led retrieval is. Basically a perplexity vs context length plot comparing both methods like turboderp said, for various tasks. Otherwise, what's the point? It's just a slower way.

Actually, now I want to make that same plot for the shareGPT dataset as a reference. Presumably those were made with chatGPT turbo 4K, so we could see what potential there is in that dataset in the first place. The cost has come down, and the new 16K turbo model is also pretty cheap: should be possible to automatically generate some story-writing evaluation datasets to test contexts up to 16K using chatGPT (at least, to the extent that chatGPT is itself trained to exploit it).

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Landmark Attention support #60

{{title}}

Replies: 17 comments

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Landmark Attention support #60

grimulkan Jun 12, 2023

Replies: 17 comments

turboderp Jun 12, 2023 Maintainer

Ph0rk0z Jun 12, 2023

grimulkan Jun 12, 2023 Author

turboderp Jun 12, 2023 Maintainer

grimulkan Jun 12, 2023 Author

turboderp Jun 12, 2023 Maintainer

grimulkan Jun 12, 2023 Author

Ph0rk0z Jun 13, 2023

grimulkan Jun 13, 2023 Author

turboderp Jun 13, 2023 Maintainer

grimulkan Jun 13, 2023 Author

turboderp Jun 14, 2023 Maintainer

grimulkan Jun 14, 2023 Author

Ph0rk0z Jun 14, 2023

turboderp Jun 14, 2023 Maintainer

Ph0rk0z Jun 15, 2023

grimulkan Jun 15, 2023 Author

grimulkan
Jun 12, 2023

turboderp
Jun 12, 2023
Maintainer

Ph0rk0z
Jun 12, 2023

grimulkan
Jun 12, 2023
Author

turboderp
Jun 12, 2023
Maintainer

grimulkan
Jun 12, 2023
Author

turboderp
Jun 12, 2023
Maintainer

grimulkan
Jun 12, 2023
Author

Ph0rk0z
Jun 13, 2023

grimulkan
Jun 13, 2023
Author

turboderp
Jun 13, 2023
Maintainer

grimulkan
Jun 13, 2023
Author

turboderp
Jun 14, 2023
Maintainer

grimulkan
Jun 14, 2023
Author

Ph0rk0z
Jun 14, 2023

turboderp
Jun 14, 2023
Maintainer

Ph0rk0z
Jun 15, 2023

grimulkan
Jun 15, 2023
Author