-
Notifications
You must be signed in to change notification settings - Fork 10.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support Llama 3 conversion #6745
Conversation
The tokenizer is BPE.
What a 🐐 |
What a champion lol. PR open within 30 minutes of model release. |
Doesn't seem that the |
I can't convert 70b on this EDIT: run with "--vocab-type bpe" |
This is what we did to get the model out -- it doesn't seem like the special tokens are added properly. We are looking deeper for further improvements / fixes. Edit by JG: made collapsible
|
@USBhost did you try with |
The instruct models need the |
When I add https://huggingface.co/meta-llama/Meta-Llama-3-70B/blob/main/original/tokenizer.model I get the same error as on convert.py |
@m18coppola the instruct models use two different EOS tokens: the standard one ( I'm not sure how to replicate this behaviour yet. The best solution would be to use a list of eos/stop tokens, but I don't know how to do it, any suggestions on where to look? Another idea would be to use |
@pcuenca for the changes: "special": false on <|start_header_id|> <|end_header_id|> <|eot_id|> |
@jxy Our comments were sent at the same time :) Yes, that's one of the solutions I mentioned, but I'm not sure it will work consistently, I've seen models that use various terminators depending on context. We can try it out though, I'll take a look. |
Sorry lads I had to run with --vocab-type bpe |
From the model card on HF: terminators = [
tokenizer.eos_token_id,
tokenizer.convert_tokens_to_ids("<|eot_id|>")
] Not sure if this is helpful or not 😅 but thought I might as well mention it. |
It seems the model generates |
It's always the tokenizer. The tokenizers are always a mess. Special tokens apply to the instruct tuned model. The ChatFormat class in the source code shows how they implemented it. The They're using Lots of new special tokens. special_tokens = [
"<|begin_of_text|>",
"<|end_of_text|>",
"<|reserved_special_token_0|>",
"<|reserved_special_token_1|>",
"<|reserved_special_token_2|>",
"<|reserved_special_token_3|>",
"<|start_header_id|>",
"<|end_header_id|>",
"<|reserved_special_token_4|>",
"<|eot_id|>", # end of turn
] + [
f"<|reserved_special_token_{i}|>"
for i in range(5, self.num_reserved_special_tokens - 5)
] This should be interesting (and not in a fun way either). This is gonna create another level of complexity. |
Instead of remapping which creates more confusion, just update the generation code to stop on eot_id. At least from my cursory tests, all special texts are tokenized properly out of the box. I did a bit of testing and chat works. |
Okay, it's in there. # BOS / EOS token IDs
self.bos_id: int = self.special_tokens["<|begin_of_text|>"]
self.eos_id: int = self.special_tokens["<|end_of_text|>"]
self.pad_id: int = -1
self.stop_tokens = {
self.special_tokens["<|end_of_text|>"],
self.special_tokens["<|eot_id|>"],
} @pcuenca The list of stop tokens are usually added during inference. The chat templates have been embedded lately into I think I get it now. Completions:
Instructions:
That's how I'm interpreting it at the moment. Feel free to correct me. |
@teleprint-me Yep, you just have to stop on eot_id instead which is: You can use the tokenization tool to test: https://github.com/ggerganov/llama.cpp/blob/master/examples/tokenize/tokenize.cpp
|
This appears to work for chatting with the model (instruct):
|
In the original Now it's assuming the huggingface BPE format instead of BPE in a general implementation as it was originally. These changes continue to break the The current implementation for 17:46:36 | /mnt/valerie/remote/ggerganov/llama.cpp
(.venv) git:(llama3-conversion | θ) λ python convert.py --vocab-type bpe /mnt/valerie/models/meta-llama/Meta-Llama-3-8B-Instruct
Loading model file /mnt/valerie/models/meta-llama/Meta-Llama-3-8B-Instruct/consolidated.00.pth
params = Params(n_vocab=128256, n_embd=4096, n_layer=32, n_ctx=4096, n_ff=14336, n_head=32, n_head_kv=8, n_experts=None, n_experts_used=None, f_norm_eps=1e-05, rope_scaling_type=None, f_rope_freq_base=500000.0, f_rope_scale=None, n_orig_ctx=None, rope_finetuned=None, ftype=None, path_model=PosixPath('/mnt/valerie/models/meta-llama/Meta-Llama-3-8B-Instruct'))
Traceback (most recent call last):
File "/mnt/valerie/remote/ggerganov/llama.cpp/convert.py", line 1555, in <module>
main()
File "/mnt/valerie/remote/ggerganov/llama.cpp/convert.py", line 1522, in main
vocab, special_vocab = vocab_factory.load_vocab(vocab_types, model_parent_path)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/mnt/valerie/remote/ggerganov/llama.cpp/convert.py", line 1424, in load_vocab
vocab = self._create_vocab_by_path(vocab_types)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/mnt/valerie/remote/ggerganov/llama.cpp/convert.py", line 1414, in _create_vocab_by_path
raise FileNotFoundError(f"Could not find a tokenizer matching any of {vocab_types}")
FileNotFoundError: Could not find a tokenizer matching any of ['bpe'] The issue propogates from the
These issues are not related to this PR but are now affecting it.
BPE tokenizer implementations keep me up at night. Rant aside, the It should be noted that Llama 1 and Llama 2 used 17:51:49 | /mnt/valerie/models/meta-llama
λ file Meta-Llama-3-8B-Instruct/tokenizer.model
Meta-Llama-3-8B-Instruct/tokenizer.model: ASCII text
17:52:05 | /mnt/valerie/models/meta-llama
λ file /mnt/scsm/models/facebook/llama-2/llama-2-7b/tokenizer.model
/mnt/scsm/models/facebook/llama-2/llama-2-7b/tokenizer.model: data |
cc @dranger003 - I really appreciated your ppl chart visual + measured ppl gap table for different quantization types for CommandR+. Do you think you would be willing to recreate those comparisons on L3 70b (base or Instruct, preferably base?) Thanks |
@teleprint-me Are you saying that it's a happy coincidence that the current llama.cpp implementation happens to tokenize correctly or there exists character sequences out there that will be tokenized incorrectly? |
Does anyone have convert instructions that work - I'm trying both Meta and HF models using this PR and none of the convert scripts work: $ python3.11 convert.py ~/Data/llama3/Meta-Llama-3-8B/ --outfile ./models/llama-8b-v3/ggml-model-f16.gguf --outtype f16 --vocab-type bpe
FileNotFoundError: Could not find a tokenizer matching any of ['bpe'] $ python3.11 convert-hf-to-gguf.py ~/Data/huggingface/Meta-Llama-3-8B/ --outfile ./models/llama-8b-v3/ggml-model-f16.gguf --outtype f16
FileNotFoundError: File not found: /Users/ggerganov/Data/huggingface/Meta-Llama-3-8B/tokenizer.model I see a few other people reporting the same problems. Those who succeeded - what were the necessary changes? |
@ggerganov Try |
@ddh0 The $ python3.11 convert-hf-to-gguf.py ~/Data/huggingface/Meta-Llama-3-8B/ --outfile ./models/llama-8b-v3/ggml-model-f16.gguf --outtype f16 --vocab-type bpe
usage: convert-hf-to-gguf.py [-h] [--vocab-only] [--awq-path AWQ_PATH] [--outfile OUTFILE] [--outtype {f32,f16}] [--bigendian] [--use-temp-file] model
convert-hf-to-gguf.py: error: unrecognized arguments: --vocab-type bpe |
@ggerganov These are working on my end.
|
Hi @abasu0713, I followed your walkthrough and cloned the latest hf model and
|
@XiongjieDai are you using the right file name? Cause the quantized model it's written ends only with |
Oh man, sorry about the typo. No, it is still not working.
|
@abasu0713 @XiongjieDai is using the wrong script.
Should use HF script instead.
Use the HF model too, not the one distributed from meta. https://huggingface.co/meta-llama/Meta-Llama-3-8B It will work afterwards. |
Sorry for bothering you guys. It's just a lack of slash in the path... Thank you for your prompt reply! |
Trust me, you're not alone 😅. I don't know how many times I've been stymied by a '/' or a '\'. |
The documentation really needs to be better. How many resources are being wasted because the documentation is telling people to use convert.py? I know it just cost me about a day. Why isn't there just one interface anyway? Very confusing. I've been making my own quants for months now and I still don't know which one to use when. There should be only one interface and the documentation should be up-to-date and accurate. Crazy ideas, I know |
Actually I've been wondering, what's the purpose of the convert.py script? If the hf one does everything needed, should convert.py be removed? |
Well, according to the main readme, that's the only one you should even know about and use. I feel like I'm taking crazy pills |
Thanks, sorry for the late response, just saw this one. |
The purpose of the convert.py script is to partially load tensors dynamically as the conversion occurs. This reduces memory usage during the conversion process. Normally, the entire models weights are loaded which is very RAM intensive. So thats why it exists. It should be easy to understand why this is valuable to have. |
Is this true..? I need the full amount of RAM to load models when using convert-hf-to-gguf, multiple hundred GB for the biggest ones |
It depends on the model. Load a raw torch model (like a 7B one) and watch it as begins to consume about 40gb of RAM. Thats a lot of RAM. Just to load the model! I think clraifying the scripts name would probably help with the confusion. Perhaps convert-torch.py would be more appropriate. |
I load raw 7B models with only like 20gb of VRAM :S are you loading in FP32? |
Well, the other option would be bfloat, or half, right? Quants weren't as popular and as widely available when I originally tested it. This was about 2 years ago... wow, time flies. 💀 |
A step in the right direction. Why not merge the scripts into a unified convert.py? |
@kalomaze Here it is, this is using 400 chunks imatrix on wiki.train for quants below Q6_K.
|
Did anyone find a solution to this?
|
@mapleroyal You should use
|
Even though I'm using the original meta (i.e. non-hf) model? |
By original you mean the |
Yes, exactly. Ok, got it. Thank you. |
The tokenizer is BPE.