-
Notifications
You must be signed in to change notification settings - Fork 10.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
model: support arch DbrxForCausalLM
#6515
Conversation
llama: dbrx: review graph
This comment has been minimized.
This comment has been minimized.
I think you meant |
I have not enough VRAM to compute the imatrix on Please note at the moment I did not arrive at the model loading step as I am still converting the weights to GGUF f16. So probably some adjustments to do in the graph. But your help is very welcomed 👍 EDIT: updated the imatrix output filename, thanks |
@phymbert Great work! Also for IQ3_S I think the imatrix is required. And you're right, I'm not rich enough to imatrix grok-1 on f16. |
Thanks, we will see when it will generate something :) llama.cpp/examples/quantize/quantize.cpp Line 382 in 54ea069
|
I stand corrected, and I learn yet again :) FYI, I'm finishing up Command-R+ and I'll upload DBRX here. |
Trying to quantize I get a key not found error:
|
Can anybody see anything wrong with this
I'm not expecting it to match the output of @phymbert's exactly, but it's just absolute garbage compared to his for some reason (the It's not just slightly bad either - it's sub-broken-frankenmerge level of bad! It almost feels like some of its tensors are lost or scrambled because it completely failed all the simple coding tasks phymbert's It was created with:
and I definitely pulled the instruct repo:
I don't normally use @phymbert Can you confirm the |
I run only on CPU, so the output will differ from CUDA yes.
I did not manage to complete the upload, sorry. better to restart from the original repo yes. The quantum models I have upload are computed without importance matrix. |
I think it was using CPU? I reran it with Interestingly @dranger003
It's harder for me to test this one on the coding tasks as Ollama won't import |
I meant the CPU ggml backend, not CUDA. |
Oh, I no problem. |
@jukofyork This model is an instruct model, without its template I don't think the output will be reliable. Below is what I use and the output seems consistent and reliable.
|
Or use |
Yes, but that kicks you into interactive instruct mode. Unless there is an option to use the template without auto activating interactive? |
I do not remember facing such issue. |
@abhi-mosaic any idea why the quantum models performing so bad ? is there an issue with the llama.cpp quantization methods with this architecture ? |
Can you try this? This gives me an input prompt in interactive mode.
|
Yeah, sorry for not making it clear: that was just a test to match the one here. I've been using the --chatml option and have the correct template in the Ollma modelfile too. The odd thing is that dranger003's I'm redownloading phymbert's I'll report back if I find out the problem - is there any command to dump the meta data from the GGUF files so I can run diff on them? |
|
I just pulled the latest master branch of llama.cpp like an built it hour ago.
I also tried --chatml, but got the same result. |
The iq3_xxs has been quantized without imatrix, so don't have too much hope on generation quality. |
Yes, I merged bc I was hoping to use it with Ollama later. |
I was going to say it won't work in Ollama: but it seems iq3_xxs is the only IQ type supported (?)
|
Just to say I've now completely redone everything from scratch and have successfully quantized a
The I think the problem I had yesterday was the FP16 GGUF must have been corrupted or truncated somehow; I did notice Anyway, panic over and the Big thanks to @phymbert and @dranger003 for all the help! 👍 |
So I'm just converting
These may be known but just thought I should point them out in case relevant. |
If you set the convert script to use temp files it will keep the ram in check. This should probably be parameterized and also a default. |
Ollama 0.3.2 supports dbrx. I downloaded their version and tried. |
* model: dbrx convert to gguf ggerganov#6344 * llama: support dbrx ggerganov#6344 * doc: dbrx: add the model as supported * scripts: get-wikitext-2 add unzip * llama: increase maximum experts allowed * llama: factorize moe graph implementation between grok, mixtral and dbrx --------- Co-authored-by: Megha Agarwal <[email protected]>
This model is so weird... Tried to re-quant it today to use the new BPE stuff, found out it wouldn't work so as had already deleted the old model went back to the old llama.cpp pull from mid-April (that I used above) to recreate it from scratch and it's back to working like a broken frankenmerge again!? WTF??? I thought it might be because I used 22 threads instead of 12 in the example above (and a thread race is causing it, etc), but I think it's the FP16 that must be screwing up somehow as I noticed it seems to be getting a PPL score way bigger than any other model when creating the imatrix file:
This is using Last time it did this, I completely re-downloaded and redid everything, but TBO the model isn't that great and not sure I can be arsed with all that again (plus seems unlikely I would have corrupt the same downloaded files that worked before yet again!?). |
So gonna have one more go at creating the FP16 (using branch |
So I think there may be a thread race if you set the number of quantize threads too high and I don't get the gibberish producing version if I keep to 12 threads (as suggested earlier in this PR). But it also seems this model doesn't play nice with the imatrix calculation either (compared to phymbert's original I'm now experimenting with turning on all 16 experts to create the imatrix file: imatrix created with 4 experts
imatrix created with 16 experts
and using that imatrix to quantize the original FP16 (ie: with 4 experts) to see the effect... It could be that some of the experts aren't getting triggered at all or are getting weighted so low in the gating that they aren't contributing hardly anything. With regard to my comment on the potential use of Tikhonov regularization: this would be solved by setting the diagonals in proportion to the sqrt(n) (where n is number of times the expert was not skipped over) instead of using the identity matrix to account for the sample size differences. Depending on how the value of |
Using all 16 experts does somewhat work, but looking through the code I think I can see what is the root of the problem:
By setting
I see 2 potential problems:
We are dividing I think it would be a good idea for somebody who knows the code base to have a really close look at what is going on here and possibly also double check that the softmax gate-weight factors are properly taken into account via backprop, etc. (I think) a simple (hacky) fix for (2) is:
to:
for the expert loop (assuming there isn't some double counting in the loops I can't see). @ggerganov I don't know if it's worth making a separate issue for this or if it is a problem introduced when |
My head hurts thinking about this but pretty sure (2) is correct via this example: If only 1 expert were selected the counts would look like this:
So for every 16 samples that went into the attention weighting factors 1 sample went into each of the expert's weighting factors (on average). If all 16 expert were selected the counts would look like this:
So for every 1 sample that went into the attention weighting factors 1 sample went into each of the expert's weighting factors also. In both cases it looks like the expert's weighting factors are getting divided by 16 times more than they should be. I'm trying the What difference does it make to the code in
Considering that the quantization is per-tensor anyway? The |
On further thought maybe this doesn't matter as the On even further thought is this actually intensional or accidental? If the old code created a diagonal hessian approximation per expert and now it's all lumped into one huge tenors; the experts that don't get selected and/or have a lower softmax gate weight are going to have their importance downgraded? Is that what is really wanted or not? It looks like this is the correct PR for this: #6387 |
Support of arch
DbrxForCausalLM
DBRX is a mixture-of-experts model, which each FFN is divided into 16 experts and only 4 are activated at any given time; provided by Databricks.
Notable differences from Mixtral presented by @abhi-mosaic are:
The graph from modeling_dbrx.py is:
input>layers[Norm>Attention(qkv,clamp,rope)>Norm>MOE_ffn]>Norm>Output
Thanks to @slaren as it was pretty straightforward after Grok-1 experts merged example.
Special thanks to @megha95 for the review and fixes.
Closes #6344.
Changes
dbrx
architecture inconvert-hf-to-gguf.py
,gguf-py
andllama.cpp
dbrx
graph implementation inllama.cpp
eval-callback
also prints last n elements of each dimensionTests (WIP, help welcomed)
0. Setup llama.cpp
2.b Debug the graph
./build/bin/eval-callback \ --model models/dbrx-16x12b-instruct-f16.gguf \ --prompt "hello world!" \ --seed 42 \ --chatml
./scripts/get-wikitext-2.sh ./build/bin/perplexity \ --model models/dbrx-16x12b-instruct-q4_0.gguf \ -ngl 41 \ -f wikitext-2-raw/wiki.test.raw \ -b 512 # Results TODO (help welcomed)
./scripts/get-hellaswag.sh ./build/bin/perplexity \ --model models/dbrx-16x12b-instruct-q4_0.gguf \ -ngl 41 \ -f hellaswag_val_full.txt \ --hellaswag \ --hellaswag-tasks 400 # Results TODO (help welcomed)
./scripts/get-wikitext-2.sh ./build/bin/imatrix \ --model models/dbrx-16x12b-instruct-f16.gguf \ -f wikitext-2-raw/wiki.train.raw \ -o dbrx-16x12b-instruct-f16.imatrix \ --seed 42 \ --chatml # Results TODO (help welcomed)
For convenience, imatrix will be uploaded to https://huggingface.co/phymbert/dbrx-16x12b-instruct-f16/tree/main
8. Split and upload to HF (because we love it)
Examples
./build/bin/main \ --model models/dbrx-16x12b-instruct-f16.gguf \ --seed 42 \ --prompt "I believe the meaning of life is" I believe the meaning of life is to learn and grow as a person. To become a better person in every possible way. To learn from your mistakes. To learn from other people and their experiences. To learn from different cultures and ways of life. To learn from different religions and philosophies. To learn from the world around you and the universe you live in. To learn from every single thing you encounter and experience in your life. To learn from every single person you meet and every single person who crosses your path. To learn from every single thing you see, hear, touch, taste, and smell. To learn from every single emotion you feel. To learn from every single thought you have. To learn from every single dream you have. To learn from every single moment of your life. To learn from every single experience you have. To learn from every single day of your life. To learn from every single year of your life. To learn from every single decade of your life. To learn from every single lifetime you live. To learn from every single thing you do. To learn from every single thing you say. To learn from every single thing you think. To learn from every single thing you feel. To learn from every single thing you experience. To learn from every single thing you encounter. To learn from every single thing you live. To learn from every single thing you are. To learn from every single thing you become. To learn from every single thing you do. To learn from every single thing you say. To learn from every single thing you think. To learn from every single thing you feel. To learn from every single thing you experience. To learn from every single thing you encounter. To learn from every single thing you live. To learn from every single thing you are. To learn from every single thing you become. To learn from every single thing you do. To learn from every single thing you say. To learn from every single thing you think. To learn from every single thing you feel. To learn from every single thing you experience. To learn from every single thing you encounter. To learn from every single thing you live. To learn from every single thing you are. To learn from every single thing you become. To learn from every single thing you do. To learn from every single thing you say. To learn from every single thing you think. To learn from every single thing you feel. To learn from every single thing you experience. To learn from every single thing you encounter. To learn from every single thing you live. [end of text]
Q8_0
Q6_K
Q4_0
Q3_K_M
IQ3_S
IQ3_XXS
All quantum models are uploaded to phymbert/dbrx-16x12b-instruct-gguf collection
Usage in the server
Tasks
gpt2
/bpe
except with different vocab & merges: model: support archDbrxForCausalLM
#6515 (comment) REMOVED by Databricks thankschatml
yes, EDIT REMOVED, standard nowLicense
DBRX
is distributed under the Databricks Open Model License agreement.