-
Notifications
You must be signed in to change notification settings - Fork 11k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
OpenLLaMA 3B support #1588
OpenLLaMA 3B support #1588
Conversation
Not working, perplexity around 2000.
I can confirm the following change can help to solve that.
Update:
Test with n_mult = 216 |
ne should be 3200 * 8640; Same as the hidden_size * intermediate_size. |
Are you sure? How did you calculate this number? When force it to load with To patch the model file without going through the reconversion you can use xxd like this to change # c8 = 200, 6c = 108
echo "0010: c8" | xxd -r - /models/open_llama_3b_preview_600bt_f16.bin |
Sorry, I did n_mult = 216 And convert.py line 610 Then all done. |
It seems 216 and 108 have the same effect on calculating
Yes, this actually matters 👍 |
my initially thinking like that number should close to 200, unless there's a reason to get sudden drop😹 |
With 256 |
Totally no clue about that, maybe you are right, even that not make any more sense... |
Ideally it should be read from the json file and saved in the model as |
Co-authored-by: FNsi <[email protected]>
I agree with that, might be another file type change is needed. |
Ugh, I'd rather not. Maybe the field could be used for backward compat, so that if it is 256 then it means |
Yep, might the 20b 40b 50b 120b llama come in the future... |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think all ok
Yes |
Perplexity done, in the description ↑ |
I uploaded a working quantized version to HuggingFace |
and checksums as well:
|
Confirmed to have same checksum as mine. @Sovenok-Hacker, do you have the F16 version of the 350bt checkpoint available somewhere? |
Currently it seems to be working, the only problem I see right now is that convert.py does not load the correct parameters from the JSON file, but I don't really know enought about it to know how to change it. |
Some how I'd like this 3b, easy for training, seems another way to achieve unlimited context length... |
It seems like they trained it more on popular culture, it seems to understand Star Trek characters more than even LLaMa 30B: https://mdon.ee/@slyecho/110437699369998728 |
As the team labelled, it use red Pajama-data-1t, same as the mpt-7B, and then I found mpt has a 1b version... |
Appreciate your great work, do you have all those ggml.bin available on HF or somewhere else? |
I put a link in the description with the files. |
The Would be cool if we have at least the |
I can remove convert.py for now and create an issue to update it. In the worst case we can just provide the model files for users right now. |
can i just use the changes view?
Somehow I think it's complex or will lead to use transformers 😅 is there a simple way to replace n_multi? |
Yes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
quickly ran some tests (edit: i let them predict for >10min each)
- q4_0
- q5_1
- f16
so I think this can be merged as is :)
@@ -58,6 +59,7 @@ static const size_t MB = 1024*1024; | |||
static const std::map<e_model, size_t> & MEM_REQ_SCRATCH0() | |||
{ | |||
static std::map<e_model, size_t> k_sizes = { | |||
{ MODEL_3B, 128ull * MB }, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was suspicious here, since its so much less. But it ran without any issue for me, so i guess the others might be too large. :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tested before merging with --no-mmap
f16 model --memory-f32
-c 2048
-n 4096
which should be the worst case but it worked
Just for reference, the diff for the converter script: --- a/convert.py 2023-05-30 20:48:07.687486627 +0300
+++ b/convert.py 2023-05-30 20:47:55.854142065 +0300
@@ -143,12 +143,22 @@
def guessed(model: 'LazyModel', file_type: GGMLFileType) -> 'Params':
n_vocab, n_embd = model["tok_embeddings.weight"].shape
+ n_mult=256
+ n_head=n_embd // 128
+ n_layer=next(i for i in itertools.count() if f"layers.{i}.attention.wq.weight" not in model)
+
+ # TODO: hack for open_llama_3b
+ if n_embd == 3200:
+ n_mult = 216
+ n_head = 32
+ n_layer = 26
+
return Params(
n_vocab=n_vocab,
n_embd=n_embd,
- n_mult=256,
- n_head=n_embd // 128,
- n_layer=next(i for i in itertools.count() if f"layers.{i}.attention.wq.weight" not in model),
+ n_mult=n_mult,
+ n_head=n_head,
+ n_layer=n_layer,
file_type=file_type,
)
@@ -597,7 +607,9 @@
out["norm.weight"] = model["model.norm.weight"]
out["output.weight"] = model["lm_head.weight"]
- n_head = model["model.layers.0.self_attn.q_proj.weight"].shape[1] // 128
+ # TODO: hack for open_llama_3b
+ n_embd = model["model.layers.0.self_attn.q_proj.weight"].shape[1]
+ n_head = 32 if n_embd == 3200 else n_embd // 128
for i in itertools.count():
if f"model.layers.{i}.self_attn.q_proj.weight" not in model:
break |
Hi @SlyEcho just noticed that the scratch buffers for 3B are a bit too small to use batch size 512. Suggest increasing them from 128MB to 256MB |
Alright, created new PR. |
Not working, perplexity around 2000.The code doesn't crash and the model can be loaded.
ggml files: huggingface.co/SlyEcho/open_llama_3b_ggml
Source data here: huggingface.co/openlm-research/open_llama_3b_600bt_preview
More info: #1291
Perplexity on wiki.test.raw with
-b 512 -c 512