-
Notifications
You must be signed in to change notification settings - Fork 10.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Command-R Model #6033
Add Command-R Model #6033
Conversation
I can successfully convert the model to GGUF format with these changes. |
You are so cool 👍 |
It seems that cohere use a different way of rotate_half in their codebase, not like other llama-based ones. Would this be okay with llama.cpp? |
I noticed some potentially useful comments about the model here: Lightning-AI/litgpt#1089 |
I think they've missed the rotate_half part - which is different while my fine-tuning test, not sure if it is significant to inference.
|
If I'm reading the code correctly in modeling, this is the same RoPE as usual: Lines 12696 to 12702 in f30ea47
So I don't think any changes are needed. Just use |
Any ideas on where to add logit_scale? Comparing the llama and the cohere models: llama_cohere_diff.txt @@ -1212,6 +1161,7 @@
logits = torch.cat(logits, dim=-1)
else:
logits = self.lm_head(hidden_states)
+ logits = logits * self.logit_scale
logits = logits.float() |
Probably this would work: diff --git a/llama.cpp b/llama.cpp
index 38e7036a..a911cdff 100644
--- a/llama.cpp
+++ b/llama.cpp
@@ -5857,6 +5857,12 @@ struct llm_build_context {
// lm_head
cur = ggml_mul_mat(ctx0, model.output, cur);
+
+ if (model.logits_scale != 1.0f) {
+ cur = ggml_scale(ctx0, cur, model.logits_scale);
+ }
+
cb(cur, "result_output", -1);
ggml_build_forward_expand(gf, cur); |
Another change is in the transformer layer. graph TD;
Input-->InputLayerNorm;
InputLayerNorm-->Attention;
Attention-->Plus;
Input-->Plus;
Plus-->PostAttnLayerNorm;
PostAttnLayerNorm-->MLP;
MLP-->Plus2;
Plus-->Plus2
Plus2-->Output;
In cohere transformer layers there is no post-attention layernorm: graph TD;
Input-->InputLayerNorm;
Input-->Plus;
InputLayerNorm-->Attention;
Attention-->Plus;
InputLayerNorm-->MLP;
MLP-->Plus
Plus-->Output;
In the python code comparing cohere to llama model: def forward(
self,
@@ -737,7 +691,7 @@
hidden_states = self.input_layernorm(hidden_states)
# Self Attention
- hidden_states, self_attn_weights, present_key_value = self.self_attn(
+ hidden_states_attention, self_attn_weights, present_key_value = self.self_attn(
hidden_states=hidden_states,
attention_mask=attention_mask,
position_ids=position_ids,
@@ -747,16 +701,12 @@
cache_position=cache_position,
**kwargs,
)
- hidden_states = residual + hidden_states
# Fully Connected
- residual = hidden_states
- hidden_states = self.post_attention_layernorm(hidden_states)
+ hidden_states_mlp = self.mlp(hidden_states)
- hidden_states = self.mlp(hidden_states)
# Add everything together
- hidden_states = residual + hidden_states
+ hidden_states = residual + hidden_states_attention + hidden_states_mlp
outputs = (hidden_states,) |
Okay I made the changes mentioned above. Thanks for the help Georgi. |
Howdy. I was working on this same thing to get this model working on llama.cpp; but my results were incoherent when I got it to run for the first time, whereas when I tested the code in this branch in this PR, I got coherent text, so this is definitely more correct than my hack. But I also noticed Q8_0 quant does not work at all. Other than that, I can confirm that the code so far, as written by @acanis produces coherent text that seems indistinguishable from the HF version (although I didn't test exactly on logit probability level at this time) so I think in terms of getting the model correctly computed mathematically it cannot be very wrong. I can come back to this tomorrow or later this week if no one else chimes in to confirm correctness. Edit: quick edit: I noticed @acanis pushed new code about 1 hour before I wrote this comment and right after I wrote this comment. I was testing the commits before those happened. |
Thanks for testing Noeda. Any help is appreciated. I just tested llama.cpp on the F16 model (without quantization) using the suggested prompt: Prompt output is: This matches pretty closely to the output from the 8-bit reference quantized python model: Full log (CPU) is below:
Another log this time running with cuBLAS (on the same machine with an A100 GPU) with low temperature:
I also noticed that the quantize failed for Q8_0 with this error (didn't have time to investigate):
|
@acanis Thanks for your efforts! I'll pull your code tomorrow and do some testing, and compare with the HF implementation and see if there's any inaccuracies. If there's coherent text, then I don't expect major divergences. And maybe also I will try to check why am I having trouble with Q8. |
This is correct ✅ |
The commits from chatllm.cpp may help. |
lgtm, successfully create a quant & run it :) |
./quantize for Q8_0 doesn't work. |
@acanis How to support 128K context for this model by llama.cpp ? Any idea? |
oh, you're right.. that's weird |
@sweetcard I was planning to investigate that next after fixing the quantization. The current context is 8192 tokens. |
Thank you for your cool work👍 |
If the model training context is 128k, make sure to set the context length in the GGUF meta data to 128k. The loading log should report 128k:
With |
The discussion from the authors is about the context size: huggingface.co/CohereForAI/c4ai-command-r-v01/discussions/12. I’m not very clear about the real context size: 8k or 128k? |
@saurabhdash The Metal backend (this is what @sweetcard appears to be using) already computes the rope in F32 precision. For the moment I wouldn't pay much attention to the reported "degradation" because there are no steps provided by @sweetcard to reproduce, so many other things could have gone wrong in their tests. Moreover, from the screenshot earlier with the log it looks to be an OOM or integer overflow problem in the Metal implementation, so it's likely not related to this PR specifically. |
Running the F16 unquantized command-r model on hellaswag for 400 random samples I get: 86.25% Full log: https://gist.github.com/acanis/719c7474ff4439f59cddc3826a0ac34f To reproduce run the following commands:
Edit: updated perplexity flags to offload to the GPU, thanks @slaren |
@acanis Looking at your log, I noticed that you didn't offload any layers to the GPU. I am just wondering, is there any reason for that? You should be able to get much better performance adding |
Thanks so much @slaren! My mistake, I was realizing that -ngl makes this much faster! :) That explains the low GPU utilization. |
@saurabhdash @ggerganov To reproduce use the commands above but add |
I don't have anything else from my side. I am fairly confident all is working. The only divergence I'm aware of is that tokenization is not entirely equal, and that slightly alters the results: Just meddled with the HF implementation again, if we let HF run with its own tokenization, totally vanilla: HF vanilla logits
HF logits if used with llama.cpp tokenization
It's not a big difference, but it does alter top logits ordering a little bit. The only difference I see with my eyeballs in the sample text tokenizations is that tokens 206 ('\n') and 2126 ('\n\n') are done differently: if there's two newlines in the text: llama.cpp outputs a 2126 token but HF tokenizes with two 206s instead. It feels like @saurabhdash Do you which one is better with your model to inference with? Token 2126 (double newline) vs two 206s? Or maybe in other words: If you had string,
I've been over time collecting results from a shorter 400-test run on random models that I see hyped on Reddit and Command-R scores around the same as most other big open source models, here's some other models:
(don't use to rank models) A typical score for a big model is 80-something. I use it as a type of smoke test; I think at least once there was a highly upvoted hyped model on subreddit that actually was just completely broken when tested. If the hellaswag-400 test gives 40 or 50 or something as result, you know it's probably broken. Empirically Command-R kicks ass. |
One afterthought just as I wrote that: I only saw double newlines as divergence but maybe if you had a long context with emojis or lots non-English characters (e.g. long Chinese context), it could a lot more divergent. I still feel llama.cpp probably has it correct, and HF implementation is the one that has the bug. @saurabhdash when you see this, see if you can find an answer to my question I had on my previous comment. 😄 |
I checked: it is encoded as [34313, 206, 206, 17080]. PS: If there is a way to use the chat template, you'll see better results. |
Then the implementation looks correct :) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you all for the implementation and the detailed analysis - much appreciated!
@acanis If I could, I'd tip you for quickly getting this to work :) great speed. I'm impressed. @saurabhdash I still have questions about tokenization, but later today or next week I'll open that discussion on HF instead. If anything needs to be done for Thanks everyone! 👏 👏 |
Thanks so much for the help everyone! I'm glad we were able to get this support merged with correct functionality. @Noeda great work on the detailed analysis between llama.cpp and the reference, and for fixing the quantization bug, that must have been hard to track down! I'm also curious about the tokenization, I will checkout your HF discussion This was a great opportunity for me to learn the llama.cpp codebase. |
We should definitely do better with the defaults. The problem is that not everybody has enough VRAM to offload the entire model, and currently we don't have a way to detect how many layers can be offloaded automatically. |
Thanks so much for the patch and merging it so quickly after the release of Command-R. Not sure if this is the correct place to comment on a possible issue. I seem to have an problem converting the original model. I have tried clean builds of latest master and acanis repo and get the same error. I had a quick look at the convert-hf-to-gguf source and tried hacking around a bit, but encountered other issues. I'm reasonably sure this is repeatable, but also struggling to understand why non one else is reporting it. Presumably I can work around the problem by checking out the converted models, but I thought id mention it in case it's an actual issue.. python3 convert-hf-to-gguf.py --outtype f16 ./c4ai-command-r-v01 |
For those who are interested: I'm following up my tokenization question over here: https://huggingface.co/CohereForAI/c4ai-command-r-v01/discussions/27 |
@chrismrutherford you need to update your HF repo to get the latest version of the config.json: The PR to add model_max_length with the true 128k context length was only merged yesterday. |
Thank you very much for your excellent work. |
Information about the Command-R 35B model (128k context) can be found at: https://huggingface.co/CohereForAI/c4ai-command-r-v01 Based on the llama2 model with a few changes: 1) New hyper parameter to scale output logits (logit_scale) 2) Uses LayerNorm instead of RMSNorm 3) Transfomer layers have a single shared LayerNorm that feeds into both the self-attention and FFN layers in parallel. There is no post-attention LayerNorm. 4) No support for Rotary Position Embeddings (RoPE) scaling 5) No biases used Find GGUF files here: https://huggingface.co/andrewcanis/c4ai-command-r-v01-GGUF To convert model to GGUF format yourself: 1) Download Command-R Hugging Face safetensors: git lfs install git clone https://huggingface.co/CohereForAI/c4ai-command-r-v01 2) Run: python3 convert-hf-to-gguf.py --outtype f16 ./c4ai-command-r-v01
Information about the Command-R 35B model (128k context) can be found at: https://huggingface.co/CohereForAI/c4ai-command-r-v01 Based on the llama2 model with a few changes: 1) New hyper parameter to scale output logits (logit_scale) 2) Uses LayerNorm instead of RMSNorm 3) Transfomer layers have a single shared LayerNorm that feeds into both the self-attention and FFN layers in parallel. There is no post-attention LayerNorm. 4) No support for Rotary Position Embeddings (RoPE) scaling 5) No biases used Find GGUF files here: https://huggingface.co/andrewcanis/c4ai-command-r-v01-GGUF To convert model to GGUF format yourself: 1) Download Command-R Hugging Face safetensors: git lfs install git clone https://huggingface.co/CohereForAI/c4ai-command-r-v01 2) Run: python3 convert-hf-to-gguf.py --outtype f16 ./c4ai-command-r-v01
They now released a larger, 104B parameter model: C4AI Command R+ |
The new model has a Some of the parameters are different. @saurabhdash I think are the committer of that top commit to I feel like this is coming my catchphrase...but I can help add the new layer code for |
Lemme know if you need any help. It's just qk_norm and GQA instead of MHA. |
I love how active this project is :) |
Information about the Command-R 35B model (128k context) can be found at:
https://huggingface.co/CohereForAI/c4ai-command-r-v01
Based on the llama2 model with a few changes:
self-attention and FFN layers in parallel. There is no post-attention LayerNorm.
Find GGUF files here:
https://huggingface.co/andrewcanis/c4ai-command-r-v01-GGUF
To convert model to GGUF format yourself:
Download Command-R Hugging Face safetensors:
git lfs install
git clone https://huggingface.co/CohereForAI/c4ai-command-r-v01
Run:
python3 convert-hf-to-gguf.py --outtype f16 ./c4ai-command-r-v01