Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OpenLLaMA 3B support #1588

Merged
merged 3 commits into from
May 30, 2023
Merged

OpenLLaMA 3B support #1588

merged 3 commits into from
May 30, 2023

Conversation

SlyEcho
Copy link
Collaborator

@SlyEcho SlyEcho commented May 24, 2023

Not working, perplexity around 2000.

The code doesn't crash and the model can be loaded.

ggml files: huggingface.co/SlyEcho/open_llama_3b_ggml

Source data here: huggingface.co/openlm-research/open_llama_3b_600bt_preview

More info: #1291


Perplexity on wiki.test.raw with -b 512 -c 512

Q chunk perplexity
F16 [616] 8.4656
Q8_0 [616] 8.4667
Q5_1 [616] 8.5072
Q5_0 [616] 8.5156
Q4_1 [616] 8.6102
Q4_0 [616] 8.6674

Not working, perplexity around 2000.
@SlyEcho SlyEcho added the help wanted Extra attention is needed label May 24, 2023
@FNsi
Copy link
Contributor

FNsi commented May 25, 2023

I can confirm the following change can help to solve that.

1. n_mult=200,
2. Char buf [256] to char buf [200],
(In your file like line 299)

Update:
The ne still not be changed during step 2.😂

I was okay to running it only because I deleted the line lt.ne != ne. And it gives me reasonable responses no more than 3 sentences.

No clue but i'd like to find how how ne be calculated to 3200 x 8600, there must be something like int made.

Test with n_mult = 216
And another n_hand // 128 to 100,
All done well.

@FNsi
Copy link
Contributor

FNsi commented May 25, 2023

ne should be 3200 * 8640;

Same as the

hidden_size * intermediate_size.

@SlyEcho
Copy link
Collaborator Author

SlyEcho commented May 25, 2023

Char buf [256] to char buf [200],
I don't think it needs to be a smaller buffer. (Although now I see two redundant strlen() calls there. But it is inconsequential in the grand scheme of things.)

n_mult=200,

Are you sure? How did you calculate this number?

When force it to load with n_ff = 8600 like your changes, it seems to be even more garbage.


To patch the model file without going through the reconversion you can use xxd like this to change n_mult:

# c8 = 200, 6c = 108
echo "0010: c8" | xxd -r - /models/open_llama_3b_preview_600bt_f16.bin

@FNsi
Copy link
Contributor

FNsi commented May 25, 2023

Char buf [256] to char buf [200],

I don't think it needs to be a smaller buffer. (Although now I see two redundant strlen() calls there. But it is inconsequential in the grand scheme of things.)

n_mult=200,

Are you sure? How did you calculate this number?

When force it to load with n_ff = 8600 like your changes, it seems to be even more garbage.


To patch the model file without going through the reconversion you can use xxd like this to change n_mult:

# c8 = 200, 6c = 108

echo "0010: c8" | xxd -r - /models/open_llama_3b_preview_600bt_f16.bin

Sorry, I did n_mult = 216

And convert.py line 610
shape[1] // 128
should change to 100

Then all done.

image

@SlyEcho
Copy link
Collaborator Author

SlyEcho commented May 25, 2023

Sorry, I did n_mult = 216

It seems 216 and 108 have the same effect on calculating n_ff

And convert.py line 610
shape[1] // 128
should change to 100

Yes, this actually matters 👍

@FNsi
Copy link
Contributor

FNsi commented May 25, 2023

Sorry, I did n_mult = 216

It seems 216 and 108 have the same effect on calculating n_ff

And convert.py line 610

shape[1] // 128

should change to 100

Yes, this actually matters 👍

my initially thinking like that number should close to 200, unless there's a reason to get sudden drop😹

@SlyEcho
Copy link
Collaborator Author

SlyEcho commented May 25, 2023

my initially thinking like that number should close to 200, unless there's a reason to get sudden drop😹

With 256 n_ff is 8600, I guess they wanted 8640, then 256-40 = 216. Maybe this is more logical, then

@FNsi
Copy link
Contributor

FNsi commented May 25, 2023

my initially thinking like that number should close to 200, unless there's a reason to get sudden drop😹

With 256 n_ff is 8600, I guess they wanted 8640, then 256-40 = 216. Maybe this is more logical, then

Totally no clue about that, maybe you are right, even that not make any more sense...

@SlyEcho
Copy link
Collaborator Author

SlyEcho commented May 25, 2023

Totally no clue about that, maybe you are right, even that not make any more sense...

Ideally it should be read from the json file and saved in the model as n_ff, n_mult is not needed at all, otherwise.

@SlyEcho SlyEcho removed the help wanted Extra attention is needed label May 25, 2023
@SlyEcho SlyEcho marked this pull request as ready for review May 25, 2023 09:16
@FNsi
Copy link
Contributor

FNsi commented May 25, 2023

Totally no clue about that, maybe you are right, even that not make any more sense...

Ideally it should be read from the json file and saved in the model as n_ff, n_mult is not needed at all, otherwise.

I agree with that, might be another file type change is needed.

@SlyEcho
Copy link
Collaborator Author

SlyEcho commented May 25, 2023

I agree with that, might be another file type change is needed.

Ugh, I'd rather not. Maybe the field could be used for backward compat, so that if it is 256 then it means n_mult=256 else it is n_ff.

@FNsi
Copy link
Contributor

FNsi commented May 25, 2023

Yep, might the 20b 40b 50b 120b llama come in the future...

Copy link

@Sovenok-Hacker Sovenok-Hacker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think all ok

@Sovenok-Hacker
Copy link

Yep, might the 20b 40b 50b 120b llama come in the future...

Yes

@SlyEcho
Copy link
Collaborator Author

SlyEcho commented May 25, 2023

Perplexity done, in the description ↑

@Sovenok-Hacker
Copy link

I uploaded a working quantized version to HuggingFace

@SlyEcho
Copy link
Collaborator Author

SlyEcho commented May 25, 2023

and checksums as well:

0103204cb367a4ae78a6dcc107ee95a0f0f216e6d276082a534e0dc337dd7452  open_llama_3b_preview_600bt-q5_1.bin
878a64232542f174ecd41ca76f18b959cdf41944fb878b5cf6cb89ab264bd59b  open_llama_3b_preview_600bt-q4_0.bin
6e3b1e60f3135395bd32d8bb10388051c24b79bc5c0b5bc5e9cab11ebea253c3  open_llama_3b_preview_600bt-q4_1.bin
7ed15048e392ce43abae56668f8df6cb0f7f1d48e4c8e924a9fc58a82510e6ac  open_llama_3b_preview_600bt-q5_0.bin
d4d4f2425f355dd57cae7c6766bbd99cf482c8b374cbf775c230f1a8c038c617  open_llama_3b_preview_600bt-q8_0.bin
4461ccd289eed0190045fa79447262401fe432b63e6d9a7919637c420814e90b  open_llama_3b_preview_600bt-f16.bin

@SlyEcho
Copy link
Collaborator Author

SlyEcho commented May 25, 2023

I uploaded a working quantized version to HuggingFace

Confirmed to have same checksum as mine.

@Sovenok-Hacker, do you have the F16 version of the 350bt checkpoint available somewhere?

@SlyEcho SlyEcho changed the title [wip] OpenLLaMA 3B support OpenLLaMA 3B support May 26, 2023
@SlyEcho
Copy link
Collaborator Author

SlyEcho commented May 26, 2023

Currently it seems to be working, the only problem I see right now is that convert.py does not load the correct parameters from the JSON file, but I don't really know enought about it to know how to change it.

@FNsi
Copy link
Contributor

FNsi commented May 27, 2023

Some how I'd like this 3b, easy for training, seems another way to achieve unlimited context length...

@SlyEcho
Copy link
Collaborator Author

SlyEcho commented May 27, 2023

It seems like they trained it more on popular culture, it seems to understand Star Trek characters more than even LLaMa 30B: https://mdon.ee/@slyecho/110437699369998728

@FNsi
Copy link
Contributor

FNsi commented May 27, 2023

As the team labelled, it use red Pajama-data-1t, same as the mpt-7B, and then I found mpt has a 1b version...

@xingchensong
Copy link
Contributor

and checksums as well:

0103204cb367a4ae78a6dcc107ee95a0f0f216e6d276082a534e0dc337dd7452  open_llama_3b_preview_600bt-q5_1.bin
878a64232542f174ecd41ca76f18b959cdf41944fb878b5cf6cb89ab264bd59b  open_llama_3b_preview_600bt-q4_0.bin
6e3b1e60f3135395bd32d8bb10388051c24b79bc5c0b5bc5e9cab11ebea253c3  open_llama_3b_preview_600bt-q4_1.bin
7ed15048e392ce43abae56668f8df6cb0f7f1d48e4c8e924a9fc58a82510e6ac  open_llama_3b_preview_600bt-q5_0.bin
d4d4f2425f355dd57cae7c6766bbd99cf482c8b374cbf775c230f1a8c038c617  open_llama_3b_preview_600bt-q8_0.bin
4461ccd289eed0190045fa79447262401fe432b63e6d9a7919637c420814e90b  open_llama_3b_preview_600bt-f16.bin

Appreciate your great work, do you have all those ggml.bin available on HF or somewhere else?

@SlyEcho
Copy link
Collaborator Author

SlyEcho commented May 30, 2023

I put a link in the description with the files.

@Green-Sky
Copy link
Collaborator

The llama.cpp file changed look very merge-able.
But in convert.py we should load the value from the config.json file.

Would be cool if we have at least the llama.cpp changes merged before the finished 3B OpenLLaMa drops ("end of last week")

@SlyEcho
Copy link
Collaborator Author

SlyEcho commented May 30, 2023

The llama.cpp file changed look very merge-able.
But in convert.py we should load the value from the config.json file.

I can remove convert.py for now and create an issue to update it.

In the worst case we can just provide the model files for users right now.

can i just use the changes view?
@SlyEcho SlyEcho requested a review from Green-Sky May 30, 2023 13:44
@FNsi
Copy link
Contributor

FNsi commented May 30, 2023

The llama.cpp file changed look very merge-able.

But in convert.py we should load the value from the config.json file.

Would be cool if we have at least the llama.cpp changes merged before the finished 3B OpenLLaMa drops ("end of last week")

Somehow I think it's complex or will lead to use transformers 😅 is there a simple way to replace n_multi?

@Sovenok-Hacker
Copy link

I uploaded a working quantized version to HuggingFace

Confirmed to have same checksum as mine.

@Sovenok-Hacker, do you have the F16 version of the 350bt checkpoint available somewhere?

Yes

Copy link
Collaborator

@Green-Sky Green-Sky left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

quickly ran some tests (edit: i let them predict for >10min each)

  • q4_0
  • q5_1
  • f16

so I think this can be merged as is :)

@@ -58,6 +59,7 @@ static const size_t MB = 1024*1024;
static const std::map<e_model, size_t> & MEM_REQ_SCRATCH0()
{
static std::map<e_model, size_t> k_sizes = {
{ MODEL_3B, 128ull * MB },
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was suspicious here, since its so much less. But it ran without any issue for me, so i guess the others might be too large. :)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tested before merging with --no-mmap f16 model --memory-f32 -c 2048 -n 4096 which should be the worst case but it worked

@SlyEcho SlyEcho merged commit ffb06a3 into master May 30, 2023
@SlyEcho
Copy link
Collaborator Author

SlyEcho commented May 30, 2023

Just for reference, the diff for the converter script:

--- a/convert.py        2023-05-30 20:48:07.687486627 +0300
+++ b/convert.py        2023-05-30 20:47:55.854142065 +0300
@@ -143,12 +143,22 @@
     def guessed(model: 'LazyModel', file_type: GGMLFileType) -> 'Params':
         n_vocab, n_embd = model["tok_embeddings.weight"].shape

+        n_mult=256
+        n_head=n_embd // 128
+        n_layer=next(i for i in itertools.count() if f"layers.{i}.attention.wq.weight" not in model)
+
+        # TODO: hack for open_llama_3b
+        if n_embd == 3200:
+            n_mult = 216
+            n_head = 32
+            n_layer = 26
+
         return Params(
             n_vocab=n_vocab,
             n_embd=n_embd,
-            n_mult=256,
-            n_head=n_embd // 128,
-            n_layer=next(i for i in itertools.count() if f"layers.{i}.attention.wq.weight" not in model),
+            n_mult=n_mult,
+            n_head=n_head,
+            n_layer=n_layer,
             file_type=file_type,
         )

@@ -597,7 +607,9 @@
     out["norm.weight"] = model["model.norm.weight"]
     out["output.weight"] = model["lm_head.weight"]

-    n_head = model["model.layers.0.self_attn.q_proj.weight"].shape[1] // 128
+    # TODO: hack for open_llama_3b
+    n_embd = model["model.layers.0.self_attn.q_proj.weight"].shape[1]
+    n_head = 32 if n_embd == 3200 else n_embd // 128
     for i in itertools.count():
         if f"model.layers.{i}.self_attn.q_proj.weight" not in model:
             break

@Green-Sky Green-Sky added enhancement New feature or request model Model specific labels May 30, 2023
@ggerganov ggerganov deleted the open_llama_3b branch May 30, 2023 20:22
@LostRuins
Copy link
Collaborator

Hi @SlyEcho just noticed that the scratch buffers for 3B are a bit too small to use batch size 512. Suggest increasing them from 128MB to 256MB

@SlyEcho
Copy link
Collaborator Author

SlyEcho commented Jun 5, 2023

Alright, created new PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request model Model specific
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants