-
Notifications
You must be signed in to change notification settings - Fork 10.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
llama: add Grok support #6120
Comments
Falcon-180b was able to run. Pretty sure a 314b parameter will also be able to run, if at least quantized. Still, I would have no hardware to run either. 😢 |
A lot of people are saying the weights released are fp8 or int8 and not the full fp16 for some reason, could this perhaps make it challenging to add? |
Considering how this model is too big for most of us to run at home, I wonder how well it performs if we use less experts and run it as a 30Bx4 or something. Or maybe it's possible to merge and distill the expert models into a single 30B model and do some finetuning. If it ends up performing like Yi 30B then we have a win here as this is Apache licensed. Keep in mind Mixtral 8x7B was built off modified versions of the original Mistral 7B to create the 8 experts. There's a chance the Twitter team did something similar as well. |
On the readme on HuggingFace they say it's int8. |
Would like to try this in 1.58 bit 😁 |
Wouldn't that require a retrain and the code is not out either |
@fakerybakery https://huggingface.co/alpindale/grok-1 link is broken, please update the issue description. |
Seems like it was deleted completely since they uploaded it officially I assume |
Has someone had a look at the architecture yet to get an idea of how hard it is to implement? |
Fixed, thx! The author removed it since the official weights have been uploaded |
It's not particularly complicated. I started porting the Jax version to a simplified PyTorch version in hopes I can get it to run at least a little faster on a big Mac Studio and also make it more readable. Maybe about halfway done. All the components look like it's made of standard stuff. It's vanilla attention stuff, some MoE layers, same RoPE we all know and love etc. I haven't seen MoE layers before in any code so can't judge if that's very different from Mixtral. But I can say it's not particularly more complicated or less complicated than most other models I've seen. Below is a very very rough sketch I made a few hours ago of classes that would roughly correspond to what you'd want to write The weights come in 8-bit precision, and QuantizedWeight8bit is used for that. So almost all of the 300GB is in the class QuantizedWeight8bit:
weight: parameter [int8]
scales: parameter [bfloat16]
class KVMemory:
k: parameter [jax.Array]
v: parameter [jax.Array]
step: parameter [jax.Array]
class Memory:
layers: list [KVMemory]
class Router:
num_experts: int
num_selected_experts: int
class MoELayer: ...
class MHABlock: ...
class RMSNorm: ...
class RotaryEmbedding + rotate_half (EleutherAI RoPE looks like)
class MultiHeadAttention: ...
class DenseBlock: ...
class DecoderLayer: ...
class InOutEmbed: looks like weight tying here, just linear layer for first and last layer for embeddings.
class Transformer:
class LanguageModel: this is the top level class that includes everything else Vocabulary size is 128*1024 = 131072. Tokenizer looks like is a BPE (uses (I'm not working on |
@Noeda if you manage to make a working gguf I'll happily make imatrix quants, if no other changes in llama.cpp are needed. |
Hoping we get Grok 0 weights too, they initially closed my issue as for them as "not planned" but reopened it eventually so maybe?? Idk, I'm hoping Grok 0 is just like 33B model and most people could easily load that and many could probably fine tune it. I imagine if Grok 1 is just a MoE of Grok 0 it would be really trivial to implement it as well |
xai-org/grok-1#180 the attention looks weird |
@Noeda Someone had already done it all - and it seems to work with custom GrokForCausalLM. The following code works (didn't wait to fully download but it started):
You will need Now we only left with llama.cpp convert.py script to support GrokForCausalLM, and maybe some inference nuances, so llama.cpp core should also be somewhat adjusted. |
Ah, I guess I would have guessed a model with this much interest would get Torch port pretty quickly. I haven't continued at all since I wrote my last comment, life gets in the way. That'll implementation there will help me at least because I find Torch-style code that a lot more easier to read as I have so much more experience in it. Just to give expectations: When I said "I eventually I would port to .gguf/llama.cpp" I mean that in the sense if I don't see someone anyone actively trying to work on it in a week or two, I might start doing the port work, if the model doesn't seem like it's total crap (I've seen some comments on Reddit that even as a base model it's not really that good despite the size). With a model of this level of interest I would expect someone will start that work soon anyway. So please, if someone wants to start the work to add this support for Although I might help deciphering stuff about how it works, if I see discussion or comments where I happen to know something that would be useful. My experience last week with Command-R is that some contributors here are awfully fast at making stuff work :) |
Lets see if someone gets to it... one need to know the nuances of the Grok as well as implementation details of llama.cpp well enough to know if some parts are missing or not, and modify the convert.py script accordingly. Regarding the model itself, it seems like a pretty good model, Matthew Berman made a recent review on it. It's probably the best open source non censored model so far, and if not - some fine tuning will make it better, so we need all the quantization infrastructure anyway. |
A very simple PR opened to allow Grok to run on CPUs: |
This is useful. One thing to keep in mind is that we should eventually make a convert script that works straight with the OG quantum data (i.e. For example, now that we have the PyTorch implementation, we can export a map of the OG tensor files to actual weights (e.g. For the graph, the Mixtral implementation seems like a good starting point |
I wonder if tanh results are better for 1.58 quant. (1, 0, -1) |
Hi,
Please add support for Grok.
Thanks!
Relevant links:
The text was updated successfully, but these errors were encountered: