Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

llama: add Grok support #6120

Closed
fakerybakery opened this issue Mar 17, 2024 · 21 comments
Closed

llama: add Grok support #6120

fakerybakery opened this issue Mar 17, 2024 · 21 comments
Labels
enhancement New feature or request good first issue Good for newcomers model Model specific

Comments

@fakerybakery
Copy link

fakerybakery commented Mar 17, 2024

Hi,
Please add support for Grok.
Thanks!

Relevant links:

@fakerybakery fakerybakery added the enhancement New feature or request label Mar 17, 2024
@bachittle
Copy link
Contributor

bachittle commented Mar 17, 2024

Falcon-180b was able to run. Pretty sure a 314b parameter will also be able to run, if at least quantized. Still, I would have no hardware to run either. 😢

@dranger003
Copy link
Contributor

https://huggingface.co/xai-org/grok-1

@nonetrix
Copy link

A lot of people are saying the weights released are fp8 or int8 and not the full fp16 for some reason, could this perhaps make it challenging to add?

@netrunnereve
Copy link
Collaborator

Considering how this model is too big for most of us to run at home, I wonder how well it performs if we use less experts and run it as a 30Bx4 or something. Or maybe it's possible to merge and distill the expert models into a single 30B model and do some finetuning. If it ends up performing like Yi 30B then we have a win here as this is Apache licensed.

Keep in mind Mixtral 8x7B was built off modified versions of the original Mistral 7B to create the 8 experts. There's a chance the Twitter team did something similar as well.

@stduhpf
Copy link
Contributor

stduhpf commented Mar 18, 2024

A lot of people are saying the weights released are fp8 or int8 and not the full fp16 for some reason, could this perhaps make it challenging to add?

On the readme on HuggingFace they say it's int8.

@FNsi
Copy link
Contributor

FNsi commented Mar 18, 2024

Would like to try this in 1.58 bit 😁

@nonetrix
Copy link

Wouldn't that require a retrain and the code is not out either

@Konard
Copy link

Konard commented Mar 18, 2024

@fakerybakery https://huggingface.co/alpindale/grok-1 link is broken, please update the issue description.

@nonetrix
Copy link

Seems like it was deleted completely since they uploaded it officially I assume

@david-jk
Copy link

Has someone had a look at the architecture yet to get an idea of how hard it is to implement?
How much different is the architecture compared to e.g. Mixtral 8x7B?

@fakerybakery
Copy link
Author

@fakerybakery https://huggingface.co/alpindale/grok-1 link is broken, please update the issue description.

Fixed, thx! The author removed it since the official weights have been uploaded

@Noeda
Copy link
Contributor

Noeda commented Mar 18, 2024

It's not particularly complicated. I started porting the Jax version to a simplified PyTorch version in hopes I can get it to run at least a little faster on a big Mac Studio and also make it more readable. Maybe about halfway done. All the components look like it's made of standard stuff. It's vanilla attention stuff, some MoE layers, same RoPE we all know and love etc.

I haven't seen MoE layers before in any code so can't judge if that's very different from Mixtral. But I can say it's not particularly more complicated or less complicated than most other models I've seen.

Below is a very very rough sketch I made a few hours ago of classes that would roughly correspond to what you'd want to write class Module(nn.Module) for. (don't trust it's correct). I don't have a nice visualization graph. Entry to the code starts from LanguageModel and it's at the bottom of the Grok file .py file. It uses Jax but also some kind of framework that I'm not familiar with (Haiku). I think it's functional style where parameters are not defined like in Torch where you do nn.Parameter() but instead the code is run once, computation graph is made, and then parameter vectors are registered on the course of doing that. Or something along those lines.

The weights come in 8-bit precision, and QuantizedWeight8bit is used for that. So almost all of the 300GB is in the QuantizedWeight8bit one.

class QuantizedWeight8bit:
    weight: parameter [int8]
    scales: parameter [bfloat16]
class KVMemory:
    k: parameter [jax.Array]
    v: parameter [jax.Array]
    step: parameter [jax.Array]
class Memory:
    layers: list [KVMemory]
class Router:
    num_experts: int
    num_selected_experts: int
class MoELayer: ...
class MHABlock: ...
class RMSNorm: ...
class RotaryEmbedding + rotate_half (EleutherAI RoPE looks like)
class MultiHeadAttention: ...
class DenseBlock: ...
class DecoderLayer: ...
class InOutEmbed: looks like weight tying here, just linear layer for first and last layer for embeddings.
class Transformer:
class LanguageModel: this is the top level class that includes everything else

Vocabulary size is 128*1024 = 131072. Tokenizer looks like is a BPE (uses sentencepiece library).

(I'm not working on .gguf support at the moment. But eventually will jump into it if it seems like no one is picking it up).

@schmorp
Copy link

schmorp commented Mar 18, 2024

@Noeda if you manage to make a working gguf I'll happily make imatrix quants, if no other changes in llama.cpp are needed.

@nonetrix
Copy link

Hoping we get Grok 0 weights too, they initially closed my issue as for them as "not planned" but reopened it eventually so maybe?? Idk, I'm hoping Grok 0 is just like 33B model and most people could easily load that and many could probably fine tune it. I imagine if Grok 1 is just a MoE of Grok 0 it would be really trivial to implement it as well

@neverix
Copy link

neverix commented Mar 19, 2024

It's not particularly complicated. I started porting the Jax version to a simplified PyTorch version in hopes I can get it to run at least a little faster on a big Mac Studio and also make it more readable. Maybe about halfway done. All the components look like it's made of standard stuff. It's vanilla attention stuff, some MoE layers, same RoPE we all know and love etc.

xai-org/grok-1#180 the attention looks weird

@simsim314
Copy link

simsim314 commented Mar 19, 2024

@Noeda Someone had already done it all - and it seems to work with custom GrokForCausalLM.

The following code works (didn't wait to fully download but it started):

from modeling_grok import GrokForCausalLM
model = GrokForCausalLM.from_pretrained("keyfan/grok-1-hf")

You will need
modeling_grok.py
configuration_grok.py

Now we only left with llama.cpp convert.py script to support GrokForCausalLM, and maybe some inference nuances, so llama.cpp core should also be somewhat adjusted.

@Noeda
Copy link
Contributor

Noeda commented Mar 19, 2024

Ah, I guess I would have guessed a model with this much interest would get Torch port pretty quickly. I haven't continued at all since I wrote my last comment, life gets in the way. That'll implementation there will help me at least because I find Torch-style code that a lot more easier to read as I have so much more experience in it.

Just to give expectations: When I said "I eventually I would port to .gguf/llama.cpp" I mean that in the sense if I don't see someone anyone actively trying to work on it in a week or two, I might start doing the port work, if the model doesn't seem like it's total crap (I've seen some comments on Reddit that even as a base model it's not really that good despite the size). With a model of this level of interest I would expect someone will start that work soon anyway. So please, if someone wants to start the work to add this support for llama.cpp, please don't wait for me.

Although I might help deciphering stuff about how it works, if I see discussion or comments where I happen to know something that would be useful. My experience last week with Command-R is that some contributors here are awfully fast at making stuff work :)

@simsim314
Copy link

Lets see if someone gets to it... one need to know the nuances of the Grok as well as implementation details of llama.cpp well enough to know if some parts are missing or not, and modify the convert.py script accordingly.

Regarding the model itself, it seems like a pretty good model, Matthew Berman made a recent review on it. It's probably the best open source non censored model so far, and if not - some fine tuning will make it better, so we need all the quantization infrastructure anyway.

@EwoutH
Copy link
Contributor

EwoutH commented Mar 20, 2024

A very simple PR opened to allow Grok to run on CPUs:

@ggerganov
Copy link
Owner

@Noeda Someone had already done it all - and it seems to work with custom GrokForCausalLM.

The following code works (didn't wait to fully download but it started):

from modeling_grok import GrokForCausalLM
model = GrokForCausalLM.from_pretrained("keyfan/grok-1-hf")

You will need modeling_grok.py configuration_grok.py

Now we only left with llama.cpp convert.py script to support GrokForCausalLM, and maybe some inference nuances, so llama.cpp core should also be somewhat adjusted.

This is useful. One thing to keep in mind is that we should eventually make a convert script that works straight with the OG quantum data (i.e. class QuantizedWeight8bit) and converts it to Q8_0 ggml tensors, instead of dequantizing to F16.

For example, now that we have the PyTorch implementation, we can export a map of the OG tensor files to actual weights (e.g. tensor00037_000 -> blk.2.ffn_gate_inp.weight) and use that to write a script to convert the data in https://huggingface.co/xai-org/grok-1/tree/main/ckpt-0

For the graph, the Mixtral implementation seems like a good starting point

@ggerganov ggerganov added good first issue Good for newcomers model Model specific labels Mar 20, 2024
@ggerganov ggerganov moved this to Todo in ggml : roadmap Mar 20, 2024
@vonjackustc
Copy link

It's not particularly complicated. I started porting the Jax version to a simplified PyTorch version in hopes I can get it to run at least a little faster on a big Mac Studio and also make it more readable. Maybe about halfway done. All the components look like it's made of standard stuff. It's vanilla attention stuff, some MoE layers, same RoPE we all know and love etc.

xai-org/grok-1#180 the attention looks weird

I wonder if tanh results are better for 1.58 quant. (1, 0, -1)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request good first issue Good for newcomers model Model specific
Projects
Status: Done
Development

No branches or pull requests