llama: add Grok support #6120

fakerybakery · 2024-03-17T20:31:28Z

Hi,
Please add support for Grok.
Thanks!

Relevant links:

https://github.com/xai-org/grok
https://x.ai/blog/grok-os
https://twitter.com/grok/status/1769441648910479423
[NEW] Official Upload (thx to @dranger003) for linking: https://huggingface.co/xai-org/grok-1

bachittle · 2024-03-17T21:36:17Z

Falcon-180b was able to run. Pretty sure a 314b parameter will also be able to run, if at least quantized. Still, I would have no hardware to run either. 😢

dranger003 · 2024-03-17T22:03:14Z

https://huggingface.co/xai-org/grok-1

nonetrix · 2024-03-17T22:37:41Z

A lot of people are saying the weights released are fp8 or int8 and not the full fp16 for some reason, could this perhaps make it challenging to add?

netrunnereve · 2024-03-18T01:13:58Z

Considering how this model is too big for most of us to run at home, I wonder how well it performs if we use less experts and run it as a 30Bx4 or something. Or maybe it's possible to merge and distill the expert models into a single 30B model and do some finetuning. If it ends up performing like Yi 30B then we have a win here as this is Apache licensed.

Keep in mind Mixtral 8x7B was built off modified versions of the original Mistral 7B to create the 8 experts. There's a chance the Twitter team did something similar as well.

stduhpf · 2024-03-18T01:20:29Z

A lot of people are saying the weights released are fp8 or int8 and not the full fp16 for some reason, could this perhaps make it challenging to add?

On the readme on HuggingFace they say it's int8.

FNsi · 2024-03-18T04:19:34Z

Would like to try this in 1.58 bit 😁

nonetrix · 2024-03-18T05:27:32Z

Wouldn't that require a retrain and the code is not out either

Konard · 2024-03-18T07:23:12Z

@fakerybakery https://huggingface.co/alpindale/grok-1 link is broken, please update the issue description.

nonetrix · 2024-03-18T07:27:26Z

Seems like it was deleted completely since they uploaded it officially I assume

david-jk · 2024-03-18T11:11:09Z

Has someone had a look at the architecture yet to get an idea of how hard it is to implement?
How much different is the architecture compared to e.g. Mixtral 8x7B?

fakerybakery · 2024-03-18T16:18:19Z

@fakerybakery https://huggingface.co/alpindale/grok-1 link is broken, please update the issue description.

Fixed, thx! The author removed it since the official weights have been uploaded

Noeda · 2024-03-18T16:29:32Z

It's not particularly complicated. I started porting the Jax version to a simplified PyTorch version in hopes I can get it to run at least a little faster on a big Mac Studio and also make it more readable. Maybe about halfway done. All the components look like it's made of standard stuff. It's vanilla attention stuff, some MoE layers, same RoPE we all know and love etc.

I haven't seen MoE layers before in any code so can't judge if that's very different from Mixtral. But I can say it's not particularly more complicated or less complicated than most other models I've seen.

Below is a very very rough sketch I made a few hours ago of classes that would roughly correspond to what you'd want to write class Module(nn.Module) for. (don't trust it's correct). I don't have a nice visualization graph. Entry to the code starts from LanguageModel and it's at the bottom of the Grok file .py file. It uses Jax but also some kind of framework that I'm not familiar with (Haiku). I think it's functional style where parameters are not defined like in Torch where you do nn.Parameter() but instead the code is run once, computation graph is made, and then parameter vectors are registered on the course of doing that. Or something along those lines.

The weights come in 8-bit precision, and QuantizedWeight8bit is used for that. So almost all of the 300GB is in the QuantizedWeight8bit one.

class QuantizedWeight8bit:
    weight: parameter [int8]
    scales: parameter [bfloat16]
class KVMemory:
    k: parameter [jax.Array]
    v: parameter [jax.Array]
    step: parameter [jax.Array]
class Memory:
    layers: list [KVMemory]
class Router:
    num_experts: int
    num_selected_experts: int
class MoELayer: ...
class MHABlock: ...
class RMSNorm: ...
class RotaryEmbedding + rotate_half (EleutherAI RoPE looks like)
class MultiHeadAttention: ...
class DenseBlock: ...
class DecoderLayer: ...
class InOutEmbed: looks like weight tying here, just linear layer for first and last layer for embeddings.
class Transformer:
class LanguageModel: this is the top level class that includes everything else

Vocabulary size is 128*1024 = 131072. Tokenizer looks like is a BPE (uses sentencepiece library).

(I'm not working on .gguf support at the moment. But eventually will jump into it if it seems like no one is picking it up).

schmorp · 2024-03-18T19:54:22Z

@Noeda if you manage to make a working gguf I'll happily make imatrix quants, if no other changes in llama.cpp are needed.

nonetrix · 2024-03-18T20:00:12Z

Hoping we get Grok 0 weights too, they initially closed my issue as for them as "not planned" but reopened it eventually so maybe?? Idk, I'm hoping Grok 0 is just like 33B model and most people could easily load that and many could probably fine tune it. I imagine if Grok 1 is just a MoE of Grok 0 it would be really trivial to implement it as well

neverix · 2024-03-19T04:53:08Z

It's not particularly complicated. I started porting the Jax version to a simplified PyTorch version in hopes I can get it to run at least a little faster on a big Mac Studio and also make it more readable. Maybe about halfway done. All the components look like it's made of standard stuff. It's vanilla attention stuff, some MoE layers, same RoPE we all know and love etc.

xai-org/grok-1#180 the attention looks weird

simsim314 · 2024-03-19T21:31:56Z

@Noeda Someone had already done it all - and it seems to work with custom GrokForCausalLM.

The following code works (didn't wait to fully download but it started):

from modeling_grok import GrokForCausalLM
model = GrokForCausalLM.from_pretrained("keyfan/grok-1-hf")

You will need
modeling_grok.py
configuration_grok.py

Now we only left with llama.cpp convert.py script to support GrokForCausalLM, and maybe some inference nuances, so llama.cpp core should also be somewhat adjusted.

Noeda · 2024-03-19T22:01:15Z

Ah, I guess I would have guessed a model with this much interest would get Torch port pretty quickly. I haven't continued at all since I wrote my last comment, life gets in the way. That'll implementation there will help me at least because I find Torch-style code that a lot more easier to read as I have so much more experience in it.

Just to give expectations: When I said "I eventually I would port to .gguf/llama.cpp" I mean that in the sense if I don't see someone anyone actively trying to work on it in a week or two, I might start doing the port work, if the model doesn't seem like it's total crap (I've seen some comments on Reddit that even as a base model it's not really that good despite the size). With a model of this level of interest I would expect someone will start that work soon anyway. So please, if someone wants to start the work to add this support for llama.cpp, please don't wait for me.

Although I might help deciphering stuff about how it works, if I see discussion or comments where I happen to know something that would be useful. My experience last week with Command-R is that some contributors here are awfully fast at making stuff work :)

simsim314 · 2024-03-19T22:18:55Z

Lets see if someone gets to it... one need to know the nuances of the Grok as well as implementation details of llama.cpp well enough to know if some parts are missing or not, and modify the convert.py script accordingly.

Regarding the model itself, it seems like a pretty good model, Matthew Berman made a recent review on it. It's probably the best open source non censored model so far, and if not - some fine tuning will make it better, so we need all the quantization infrastructure anyway.

EwoutH · 2024-03-20T05:39:24Z

A very simple PR opened to allow Grok to run on CPUs:

Allows CPU-based execution xai-org/grok-1#235

ggerganov · 2024-03-20T07:28:04Z

@Noeda Someone had already done it all - and it seems to work with custom GrokForCausalLM.

The following code works (didn't wait to fully download but it started):
from modeling_grok import GrokForCausalLM
model = GrokForCausalLM.from_pretrained("keyfan/grok-1-hf")
You will need modeling_grok.py configuration_grok.py

Now we only left with llama.cpp convert.py script to support GrokForCausalLM, and maybe some inference nuances, so llama.cpp core should also be somewhat adjusted.

This is useful. One thing to keep in mind is that we should eventually make a convert script that works straight with the OG quantum data (i.e. class QuantizedWeight8bit) and converts it to Q8_0 ggml tensors, instead of dequantizing to F16.

For example, now that we have the PyTorch implementation, we can export a map of the OG tensor files to actual weights (e.g. tensor00037_000 -> blk.2.ffn_gate_inp.weight) and use that to write a script to convert the data in https://huggingface.co/xai-org/grok-1/tree/main/ckpt-0

For the graph, the Mixtral implementation seems like a good starting point

vonjackustc · 2024-03-22T03:33:53Z

It's not particularly complicated. I started porting the Jax version to a simplified PyTorch version in hopes I can get it to run at least a little faster on a big Mac Studio and also make it more readable. Maybe about halfway done. All the components look like it's made of standard stuff. It's vanilla attention stuff, some MoE layers, same RoPE we all know and love etc.

xai-org/grok-1#180 the attention looks weird

I wonder if tanh results are better for 1.58 quant. (1, 0, -1)

fakerybakery added the enhancement New feature or request label Mar 17, 2024

fakerybakery mentioned this issue Mar 17, 2024

Run on PC xai-org/grok-1#3

Open

trholding mentioned this issue Mar 18, 2024

🤖 Now run grok-1 with less than 🔲 420 G VRAM ⚡ xai-org/grok-1#42

Closed

ArakiSatoshi mentioned this issue Mar 19, 2024

Quantized Version or Reduced Parameter Variant of Grok-1 xai-org/grok-1#156

Open

ggerganov added good first issue Good for newcomers model Model specific labels Mar 20, 2024

ggerganov added this to ggml : roadmap Mar 20, 2024

ggerganov moved this to Todo in ggml : roadmap Mar 20, 2024

arki05 mentioned this issue Mar 21, 2024

Add grok-1 support #6204

Merged

rgerganov mentioned this issue Mar 22, 2024

ggml : add RPC backend ggerganov/ggml#761

Closed

nicoboss mentioned this issue Mar 30, 2024

GROK-1 Support oobabooga/text-generation-webui#5725

Closed

ggerganov closed this as completed May 8, 2024

ggerganov moved this from Todo to Done in ggml : roadmap May 8, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

llama: add Grok support #6120

llama: add Grok support #6120

fakerybakery commented Mar 17, 2024 •

edited

Loading

bachittle commented Mar 17, 2024 •

edited

Loading

dranger003 commented Mar 17, 2024

nonetrix commented Mar 17, 2024

netrunnereve commented Mar 18, 2024

stduhpf commented Mar 18, 2024

FNsi commented Mar 18, 2024

nonetrix commented Mar 18, 2024

Konard commented Mar 18, 2024

nonetrix commented Mar 18, 2024

david-jk commented Mar 18, 2024

fakerybakery commented Mar 18, 2024

Noeda commented Mar 18, 2024

schmorp commented Mar 18, 2024

nonetrix commented Mar 18, 2024

neverix commented Mar 19, 2024

simsim314 commented Mar 19, 2024 •

edited

Loading

Noeda commented Mar 19, 2024

simsim314 commented Mar 19, 2024

EwoutH commented Mar 20, 2024

ggerganov commented Mar 20, 2024

vonjackustc commented Mar 22, 2024

llama: add Grok support #6120

llama: add Grok support #6120

Comments

fakerybakery commented Mar 17, 2024 • edited Loading

bachittle commented Mar 17, 2024 • edited Loading

dranger003 commented Mar 17, 2024

nonetrix commented Mar 17, 2024

netrunnereve commented Mar 18, 2024

stduhpf commented Mar 18, 2024

FNsi commented Mar 18, 2024

nonetrix commented Mar 18, 2024

Konard commented Mar 18, 2024

nonetrix commented Mar 18, 2024

david-jk commented Mar 18, 2024

fakerybakery commented Mar 18, 2024

Noeda commented Mar 18, 2024

schmorp commented Mar 18, 2024

nonetrix commented Mar 18, 2024

neverix commented Mar 19, 2024

simsim314 commented Mar 19, 2024 • edited Loading

Noeda commented Mar 19, 2024

simsim314 commented Mar 19, 2024

EwoutH commented Mar 20, 2024

ggerganov commented Mar 20, 2024

vonjackustc commented Mar 22, 2024

fakerybakery commented Mar 17, 2024 •

edited

Loading

bachittle commented Mar 17, 2024 •

edited

Loading

simsim314 commented Mar 19, 2024 •

edited

Loading