-
Notifications
You must be signed in to change notification settings - Fork 10.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add grok-1 support #6204
Add grok-1 support #6204
Conversation
Good job (I noticed the fork earlier 😉 )
I think it is important to have such script because using the F16 weights to re-quantize to But this is already a great start and we can add the script in another PR |
Hm, we definitely need the JAX -> GGUF script - converting with
|
Setting But yes I'm already working on a JAX to GGUF script. |
./main -m ./models/grok-1/ggml-model-iq3_s.gguf -p "The answer to life the universe and everything is of course" -s 1 -n 64 -ngl 99
|
With
(Don't have enough ram for Looking at other Inference runs (on the official GitHub xai-org/grok-1) the quality seems comparable imo. Can't find any issues with the implementation, but it can't really rule it out either. |
Something is definitely wrong because the perplexity is through the roof: ./perplexity -m ./models/grok-1/ggml-model-iq3_s.gguf -f build/wikitext-2-raw/wiki.test.raw -ngl 99
For comparison, Mixtral 8x7B
I'll find some time in the next days to investigate since I can run much more efficiently the model. But we should fix this before merging |
The rope type was incorrect. It works now:
grok-1.mp4 |
Can you share your hardware / ram usage and other specs? @ggerganov and @arki05 |
(answering for Georgi based on the bits here and there) |
I'm working on a Threadripper 3955WX with 256GB RAM. As long as i use a Version that fits into 256GB i'm getting a reasonable 0.5 tokens per second. Log``` llm_load_print_meta: format = GGUF V3 (latest) llm_load_print_meta: arch = grok llm_load_print_meta: vocab type = SPM llm_load_print_meta: n_vocab = 131072 llm_load_print_meta: n_merges = 0 llm_load_print_meta: n_ctx_train = 8192 llm_load_print_meta: n_embd = 6144 llm_load_print_meta: n_head = 48 llm_load_print_meta: n_head_kv = 8 llm_load_print_meta: n_layer = 64 llm_load_print_meta: n_rot = 128 llm_load_print_meta: n_embd_head_k = 128 llm_load_print_meta: n_embd_head_v = 128 llm_load_print_meta: n_gqa = 6 llm_load_print_meta: n_embd_k_gqa = 1024 llm_load_print_meta: n_embd_v_gqa = 1024 llm_load_print_meta: f_norm_eps = 0.0e+00 llm_load_print_meta: f_norm_rms_eps = 1.0e-05 llm_load_print_meta: f_clamp_kqv = 0.0e+00 llm_load_print_meta: f_max_alibi_bias = 0.0e+00 llm_load_print_meta: f_logit_scale = 0.0e+00 llm_load_print_meta: n_ff = 32768 llm_load_print_meta: n_expert = 8 llm_load_print_meta: n_expert_used = 2 llm_load_print_meta: causal attn = 1 llm_load_print_meta: pooling type = 0 llm_load_print_meta: rope type = 2 llm_load_print_meta: rope scaling = linear llm_load_print_meta: freq_base_train = 10000.0 llm_load_print_meta: freq_scale_train = 1 llm_load_print_meta: n_yarn_orig_ctx = 8192 llm_load_print_meta: rope_finetuned = unknown llm_load_print_meta: ssm_d_conv = 0 llm_load_print_meta: ssm_d_inner = 0 llm_load_print_meta: ssm_d_state = 0 llm_load_print_meta: ssm_dt_rank = 0 llm_load_print_meta: model type = 314B llm_load_print_meta: model ftype = IQ3_XS - 3.3 bpw llm_load_print_meta: model params = 316.49 B llm_load_print_meta: model size = 120.73 GiB (3.28 BPW) llm_load_print_meta: general.name = Grok llm_load_print_meta: BOS token = 1 '[BOS]' llm_load_print_meta: EOS token = 2 '[EOS]' llm_load_print_meta: UNK token = 0 '[PAD]' llm_load_print_meta: PAD token = 0 '[PAD]' llm_load_print_meta: LF token = 79 '<0x0A>' llm_load_tensors: ggml ctx size = 0.81 MiB llm_load_tensors: CPU buffer size = 16716.66 MiB llm_load_tensors: CPU buffer size = 14592.75 MiB llm_load_tensors: CPU buffer size = 14484.75 MiB llm_load_tensors: CPU buffer size = 14901.35 MiB llm_load_tensors: CPU buffer size = 14714.18 MiB llm_load_tensors: CPU buffer size = 14493.75 MiB llm_load_tensors: CPU buffer size = 14484.75 MiB llm_load_tensors: CPU buffer size = 15250.88 MiB llm_load_tensors: CPU buffer size = 3990.96 MiB .................................................................................................... llama_new_context_with_model: n_ctx = 512 llama_new_context_with_model: n_batch = 512 llama_new_context_with_model: n_ubatch = 512 llama_new_context_with_model: freq_base = 10000.0 llama_new_context_with_model: freq_scale = 1 llama_kv_cache_init: CPU KV buffer size = 128.00 MiB llama_new_context_with_model: KV self size = 128.00 MiB, K (f16): 64.00 MiB, V (f16): 64.00 MiB llama_new_context_with_model: CPU output buffer size = 256.00 MiB llama_new_context_with_model: CPU compute buffer size = 356.03 MiB llama_new_context_with_model: graph nodes = 3782 llama_new_context_with_model: graph splits = 1system_info: n_threads = 16 / 32 | AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | User: In the context of LLMs, what is a sparse tensor?\nAssistant: It's a type of tensor that stores data in a non-contiguous way, allowing for efficient storage and computation of large data sets.\n\nUser: In the context of LLMs, what is a dense tensor?\nAssistant: It's a type of tensor that stores data in a contiguous way, allowing for efficient computation of small data sets.\n\nUser: In the context of LLMs, what is a data pipeline?\nAssistant: It's a series of steps that are used to process and analyze large amounts of data, including data cleaning, feature extraction, and model training.\n\nUser
|
it's a great job, tomorrow i will try q2 quantity of grok-1 on my m3 max. |
FYI: for anyone testing using the quants in Arki05/Grok-1-GGUF: Edit: Merged now, no need for additional branches. |
@arki05 Thank you once again - great work! |
server \
--hf-repo Arki05/Grok-1-GGUF \
--hf-file grok-1-IQ3_XS-split-00001-of-00009.gguf \
--model models/grok-1-IQ3_XS-split-00001-of-00009.gguf \
-ngl 999 |
Here is the full ppl for ./perplexity -m ./models/grok-1/ggml-model-iq3_s.gguf -f build/wikitext-2-raw/wiki.test.raw -ngl 99
Hellaswag@400 is ./perplexity --hellaswag -f build/hellaswag_val_full.txt -m models/grok-1/ggml-model-iq3_s.gguf --hellaswag-tasks 400 We will later compare this with the properly converted models that do not go through the F16 dequantization |
Any ideas how to compute an If somebody has a 700GB RAM machine, the following should do the job: ./imatrix -m ./models/grok-1/ggml-model-fp16.gguf -f ./wikitext-2-raw/wiki.train.raw -ngl 0 -b 512 Though it might take a while :) |
@ggerganov if you give me guidelines I can do anything for you (1.2TB ram, 4x E7-8890 v4, ubuntu 2204) |
@RichardErkhov Here are the commands: git clone https://huggingface.co/keyfan/grok-1-hf
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
pip install -r requirements.txt
python convert-hf-to-gguf.py ../grok-1-hf/ --outfile models/grok-1-f16.gguf --outtype f16
./scripts/get-wikitext-2.sh
unzip wikitext-2-raw-v1.zip
make -j imatrix
./imatrix -m ./models/grok-1-fp16.gguf -f ./wikitext-2-raw/wiki.train.raw -b 512 The last command will run for a few days. When done, upload the generated |
ok, see you in few days haha. I first need to download the hf model, my internet is just 80 mbps. Hopefully no electricity shutdown |
I gladly trade my 500 mbps line for your 1.2TB RAM :) |
In the meantime, I generated an https://huggingface.co/ggml-org/imatrix/blob/main/grok-1-iq3_s.imatrix It seems to help - here is a summary of zero-shot Hellaswag scores at 400 tasks for Grok-1 and Mixtral:
PPL:
|
@ggerganov It's |
No worries - seems it would need a lot of time, so feel free to stop it. Moreover the |
@ggerganov I can keep it running and just publish it when it finishes. If you need anything else just text me, Im always open for help =) |
update from "that crazy guy with 1.2TB of ram that will run some random stuff for fun" |
ah, it decided to disappear, how cool xD nevermind, I guess it's ubuntu is having some fun with long-run task |
Hi there. I made an implementation in foldl/chatllm.cpp@912bacc . I don't have enough compute resource, so, maybe we can only export a subset of experts. Test with the first 4 experts shows some meaningful but not expressive results, while with the first 2 experts, it is worse. Could someone like to have a test with all 8 experts? Doc. |
But the model is not instructed ? How can you chat ? |
@phymbert ChatLLM.cpp can work in completion mode. |
@foldl give me step by step what to execute and ask and I can run it for you |
@RichardErkhov Thank you! Here is the step by step: https://github.com/foldl/chatllm.cpp/blob/master/docs/grok.md
|
ok, if everything works you will get results tomorrow, as I need to download the repo. Give me what to ask the model |
I have no idea. Maybe ask for the answer for everything? |
@foldl idk, we will see haha. It's going 6.5mb/s, 300gb download, which is 13 hours haha, so I guess tomorrow morning I will convert and run it |
|
@RichardErkhov THANK YOU! I am satisfied with the output, and no more experiments are needed. Later, I will a full layer and compare the output against JAX implementation, as a double check. The ggml community is awesome. |
Lol, this is a very funny typo xD @foldl |
@RichardErkhov #@@#@!@ a funny typo. Why the AI-powered Edge had not corrected it for me? Lol. |
@foldl want anything else to run? I can help with projects. You can contact me in discord if you want. ganza2309 |
@RichardErkhov No, thanks. It's time to free up your disk space, :). |
Yeah, electricity went down and it cleaned itself lol |
* Add support for Grok model architecture * Revert convert-hf-to-gguf to default options * Fixed f_norm_rms_eps bug * Fix whitespaces * llama : fix grok rope type * llama : minor --------- Co-authored-by: Georgi Gerganov <[email protected]>
* Add support for Grok model architecture * Revert convert-hf-to-gguf to default options * Fixed f_norm_rms_eps bug * Fix whitespaces * llama : fix grok rope type * llama : minor --------- Co-authored-by: Georgi Gerganov <[email protected]>
* Add support for Grok model architecture * Revert convert-hf-to-gguf to default options * Fixed f_norm_rms_eps bug * Fix whitespaces * llama : fix grok rope type * llama : minor --------- Co-authored-by: Georgi Gerganov <[email protected]>
This pull request adds grok-1 support to llama.cpp (#6120).
I've added a separate
MODEL_ARCH_GROK
as to not clutter the LLAMA arch too much.The
convert-hf-to-gguf.py
can convert from keyfan/grok-1-hf to GGUF now. I've started uploading Quants in Split-GGUF format to Arki05/Grok-1-GGUF, might take a while due to the size.For now, the Graph includes a few hardcoded values like
attn_output_multiplyer
that were included in the original implementation. Maybe we should move those to a separate parameter, but I'm not sure what the policy / guidelines on those are.Would a Script to convert from the base JAX weights to gguf be helpful? If so, i can work on that next.
PS: Please be gentle it's my first Pull request on here.