Replies: 17 comments
-
It does, but there are some issues with it:
All in all I'm not keen on taking on the work of making this method fast (somehow, if it's even possible), until I've actually seen it being useful. And it would be a considerable rewrite. Enabling remote code isn't really an option because that remote code is meant to be injected into Transformers. |
Beta Was this translation helpful? Give feedback.
-
In my experience from merging the landmark lora to 13b models. So far the responses have been rather meh but coherent past 2048. The speed is terrible. A 65b replies faster than a 13b. Currently it's a fast way to stop models from going nuts over 2048. You just have to merge the lora and encode it to GPTQ. Whatever that merge of MPT storywriter and wizard did was way better. |
Beta Was this translation helpful? Give feedback.
-
If it's just a fast way to stop models from outputting nonsense beyond 2048 by merging a LORA, maybe turboderp/alpaca_lora_4bit is just as good, and is fast with exllama's current approach (I think). I haven't tried inference for the author's method outside the repo itself (turboderp/alpaca_lora_4bit): were special inference changes required, and are they already part of exllama? Would textgen need anything other than simply increasing the context length slider for inference using that method (though speed/VRAM would suck)? I wish axiotl or the original alpaca_lora_4bit merged the modifications from turboderp/alpaca_lora_4bit. It'd make that method more popular, and with the right training data, maybe it actually works? I eyeballed the loss from turboderp/alpaca_lora_4bit to see if I can get it to go lower with the bigger context while training. While the model can learn not to make it worse than 2048 context, it doesn't learn how to make it go lower. I tried feeding it obvious needle-haystack problems like the landmark attention paper does, but it didn't pick up on the correlation in my attempts. In theory though, I imagine it should, with enough data. Or maybe LORA isn't good enough for this.
Wouldn't that still be better than looking up a vector database outside the LLM (egs., using sentence-based embeddings) and inserting the relevant text that into the LLM context? I think this is what current search methods like superbooga do. At least in the landmark approach they use attention to figure relevance out.
Ah, that sucks. |
Beta Was this translation helpful? Give feedback.
-
The only change I made for inference in that repo was to use a preallocated cache, only I did it in a very hacky and breakable way to make it work with Transformers. ExLlama also uses the same kind of preallocated cache, but more elegantly. So it supports very long contexts as long as the model knows what to do with positional embeddings after position 2048. As for that, though, I haven't gotten to LoRA support just yet. But if the output of that LoRA repo were merged into the FP16 base model, and then the combined model was quantized again, I don't see why it wouldn't work with ExLlama. But I will be adding it soon, anyway. It's not really complicated. It's just that with QLoRA and GPTQ-LoRA coming out, I'm hesitating because I'm not sure what I should support first. It kind of depends what ends up being most popular. The downside of not building on Transformers is losing that plug-and-play compatibility with every new thing that comes out, so I have to prioritize with these things. And next on the list is multi-GPU matmuls that might give a big boost to 65B models on dual GPUs (fingers crossed). I'd like to get to 30 tokens/second at least.
Yep. But I'm not sure if that changes eventually. It might simply come in phases, i.e. it might be easy to learn to ignore the distant past because the contribution from those past tokens is so disruptive, and then take longer for the much more subtle positive contribution to start showing. It's really hard to say for certain, or maybe my understanding of ML just isn't deep enough, but it wouldn't surprise me if it would take more training than what originally went into the base model. After all, comparing 7B to 33B it's very clear that 33B is much better at long-range relationships even within the 2k context that both models work with. That could be because it has more layers and there's a convolutional aspect to how it "scans" the context. Or long-range relationships are inherently more complex and require more "circuitry" to comprehend.
Well, attention is still ultimately just a dot-product similarity search. I can see how using the same conditions for the search as is used for token-to-token attention could result in more relevant matches, but that still only makes it a more accurate vector database. I do hope to see some promising results from it, though. There might be ways to speed it up, if it's actually a good method. I would like to see comparisons of perplexity vs context length, for instance. It should show predictions getting better the more context it has for any given example, without plateauing at 2048 tokens. |
Beta Was this translation helpful? Give feedback.
-
That's the main reason I use your repo right now BTW, so very much looking forward to it! All other inference methods I've seen so far suck when you start splitting.
I want to keep trying to train it. I'll try landmark attention for reference too. From what you said, I should avoid the temptation to truncate and compare at less than 2048 context. The true test is crossing the limit of the normal positional encoder.
IIRC 33B and up was also trained on 40% more data than 13B and below. Maybe there is generally poor long-range context info in normal text data on the internet, so it needs both quantity and compute to make use of it. I would imagine cherry picking long conversations from shareGPT/Vicuna would guarantee some patterns beyond 2048 (but less than 4096 if using ChatGPT as a base), at least as a start. Maybe there is some good way to curate data with long-range context (other than deliberate needle-haystack). Anyway, I think I have the answer for supporting landmark attention in this repo, at least for now. Thanks. |
Beta Was this translation helpful? Give feedback.
-
My thinking is that any given text will contain more short-range information than long-range information. This is why, if you start training a transformer from scratch and periodically check in on what it's able to generate as it's training, you'll see that it starts by learning how to arrange tokens in pairs or triplets to form correct words. Then it starts stringing related words together but without really saying anything. Eventually you start to see full sentences, and then later on sequences of sentences that build on one another. It's a clear progression from short-range to long-range attention. One thought I had would be to force it to do long-range attention by only giving it long-range information. You could just tweak the position IDs to spread the examples out. This would speed up training too, since you could pass 2048 tokens but they would look to the model as, say, the beginning and end of an 8192-token sequence, with nothing in the middle for it to get confused by. I'm thinking maybe you start by adding a little space randomly between sentences, and then gradually increase it. And maybe eventually you switch to full-length examples. Idk. It's definitely on my list of things to try. |
Beta Was this translation helpful? Give feedback.
-
That's a neat idea. I was thinking of randomly adding random text of varying lengths after the actual query (for instruct-tuning). Or randomize the order of all conversation items before the most recent one in a conversation chain. But if we just offset the positional data somehow, that's way cooler. It's on-the-fly data augmentation for context training. |
Beta Was this translation helpful? Give feedback.
-
I have not tried to train a landmark attention with normal 4-bit lora (eg. alpaca_4bit). It would certainly go faster but it would train only 2 layers vs all of them Are all the layers needed to be trained with those landmark tokens? Would it be faster because less layers are modified? Or would it just not work. Merging adapters to all layers certainly makes inference slower. So not sure what happens if you only train the 2 at 4-bit and then merge it to an FP16 model. |
Beta Was this translation helpful? Give feedback.
-
I almost always train all layers if I can. For landmark, try and see? I think the 2 default layers in alpaca_4bit is where the attention to landmark tokens happens though, could be wrong. alpaca_4bit does a few odd things like defaulting to short context, default training only 2 layers, not using a rolling context (which further discourages long context learning imo), but you can revert all that without too much effort.
Is that a landmark-related statement? I guess I normally always merge to fp16 and requantize and don't notice a performance difference (non-landmark). Edit: There is a 33b finetune on red pajama dataset plus landmark now, should give it a try. Though only 200 liters so far. |
Beta Was this translation helpful? Give feedback.
-
It just follows the Stanford Alpaca paper. The authors chose those two layers after some experimentation, I believe, determining that they were by far the most important for the results they were trying to achieve. And the point was partially to show how few extra parameters they had to train (about 10 million for 7B as I recall) to turn Llama into a pretty capable instruction-following model. The short context length I think was also suitable for the short examples in the original Alpaca question/answer set. But of course it needs to be adjusted to fit whatever dataset is being applied. And especially if it's for a model that's meant to do a back-and-forth chat with a user it really needs to use the full context length.
I guess that refers to which layers of each decoder you attach adapters to (i.e. merging the computation at runtime). If you merge them into the base model's weights there's no penalty for adapting more layers. |
Beta Was this translation helpful? Give feedback.
-
It has always bugged me that even in repos designed for ShatGPT/Vicuna they chunk conversations into 2048 chunks and train on each chunk (or train on the assistant responses for that chunk). That means the responses closer to the top of the chunk get less context, which is not ideal if there was indeed prior conversation to go on. That's what I meant by "rolling" window. It means I'm always trying to train/compute loss only over the last 500 tokens or so of a 2048 chunk (the latest response). Of course, this really increases training time. For all I know it is placebo. Haven't done A/B tests. |
Beta Was this translation helpful? Give feedback.
-
I don't know if that makes sense. It might. Rotary positional encodings are supposed to generalize better than the sinusoidal encodings from vanilla transformers, but they're still computed from the absolute position in the sequence. And they apparently don't generalize beyond the 2048 tokens of the base model, without tuning. So I'd be worried about the model forgetting what to do with positions 0-1500 if it only ever sees 1500-2000 in finetuning. Then again... idk. |
Beta Was this translation helpful? Give feedback.
-
Well I'd argue in a long conversation you are only generating the last bit of context most of the time anyway (at least for current context sizes which don't exceed conversation lengths), so maybe it makes sense to have a specialized lora. You can throw in alpaca style lora/data for training the front-end. Of course, all speculation if it even matters. |
Beta Was this translation helpful? Give feedback.
-
Well if there is no penalty after merging, then landmark is very very slow no matter what. I have just been merging the lora that was released to other llama models and it has worked for the most part. Haven't narrowed down what exactly has to be swapped over besides config.json and the scripts. For llama I used all the entire tokenizer/config files set. The motivation hasn't been great because the replies were short despite the model handling the extra context and the speed was slow. |
Beta Was this translation helpful? Give feedback.
-
Well, I mean, here's the code for the modified attention function. It's dense and would take a while for me to parse, but it's clearly doing a lot of extra work on the context. From the paper I also can't see how they could get around re-evaluating the entire context for every token generated, since the position embeddings have to change depending on which blocks receive attention at any given moment. And if that weren't necessary, i.e. if the base model understood position embeddings spanning the whole sequence, then landmark attention wouldn't really be needed in the first place. So it just seems like a really slow method. They've added a Triton attention kernel that might speed things up a bit, not sure if you're using that. But I wouldn't expect miracles from that either. |
Beta Was this translation helpful? Give feedback.
-
Trition was super hyped up, but every time I've tried it, results have been worse than normal cuda kernels. Plus they don't support pre-volta and that knocks one of my systems out from the start. Makes perfect sense though. If they reprocess everything, every time. Not even dual 3090 can be fast with that. I think landmark attention is a good try, but it's not going to be more than a summarizer. |
Beta Was this translation helpful? Give feedback.
-
I'm curious to see if Landmark can beat standard vector summarization like superbooga, etc., i.e., how good the joint attention-led retrieval is. Basically a perplexity vs context length plot comparing both methods like turboderp said, for various tasks. Otherwise, what's the point? It's just a slower way. Actually, now I want to make that same plot for the shareGPT dataset as a reference. Presumably those were made with chatGPT turbo 4K, so we could see what potential there is in that dataset in the first place. The cost has come down, and the new 16K turbo model is also pretty cheap: should be possible to automatically generate some story-writing evaluation datasets to test contexts up to 16K using chatGPT (at least, to the extent that chatGPT is itself trained to exploit it). |
Beta Was this translation helpful? Give feedback.
-
Any thoughts on how difficult it would be to support inference on a model trained with landmark attention? Like Minotaur, Wizard or the base Llama landmark finetunes released recently, and I suppose more will come out, now that multiple repos support lora/qlora/gptq-lora training with landmark attention.
I haven’t compared results yet, but it sounds like landmark attention should be more effective with long contexts compared to the turboderp/alpaca_lora_4bit repo. Like the author, I found that that repo did “something”, and stopped generating gibberish beyond 2048 at least, but I’m not sure what the model learned. The landmark attention paper claims it can solve needle-haystack problems beyond the context length, which I couldn’t get the previous method to do.
Landmark apparently works with Oogabooga with remote code enabled.
Beta Was this translation helpful? Give feedback.
All reactions