can llama.cpp/convert.py support tokenizer rather than 'spm', 'bpe', 'hfft' #6690

woodx9 · 2024-04-15T16:59:48Z

I am trying to convert deepseek-ai/deepseek-coder-1.3b-base using llama.cpp/convert.py
with

Command

python llama.cpp/convert.py codes-hf
--outfile codes-1b.gguf
--outtype q8_0

Output:

Loading model file codes-hf/pytorch_model.bin
params = Params(n_vocab=32256, n_embd=2048, n_layer=24, n_ctx=16384, n_ff=5504, n_head=16, n_head_kv=16, n_experts=None, n_experts_used=None, f_norm_eps=1e-06, rope_scaling_type=<RopeScalingType.LINEAR: 'linear'>, f_rope_freq_base=100000, f_rope_scale=4.0, n_orig_ctx=None, rope_finetuned=None, ftype=<GGMLFileType.MostlyQ8_0: 7>, path_model=PosixPath('codes-hf'))
Traceback (most recent call last):
File "/home/woodx/Workspace/llamacpp/llama.cpp/convert.py", line 1548, in
main()
File "/home/woodx/Workspace/llamacpp/llama.cpp/convert.py", line 1515, in main
vocab, special_vocab = vocab_factory.load_vocab(vocab_types, model_parent_path)
File "/home/woodx/Workspace/llamacpp/llama.cpp/convert.py", line 1417, in load_vocab
vocab = self._create_vocab_by_path(vocab_types)
File "/home/woodx/Workspace/llamacpp/llama.cpp/convert.py", line 1407, in _create_vocab_by_path
raise FileNotFoundError(f"Could not find a tokenizer matching any of {vocab_types}")
FileNotFoundError: Could not find a tokenizer matching any of ['spm', 'hfft']

the "tokenizer_class": "LlamaTokenizerFast", is there a way to support it?

jukofyork · 2024-04-16T01:58:23Z

Try adding --vocab-type bpe as an opinion. IIRC, I had to do that for deepseek-coder models.

woodx9 · 2024-04-16T06:05:52Z

Try adding --vocab-type bpe as an opinion. IIRC, I had to do that for deepseek-coder models.

I try that, But I think it's not the right choice, they are two different tokenizer way after all. And I have a question of how to tokenizer them with --no-vocab

phymbert · 2024-04-16T06:11:32Z

Deepseek models support is in progress:

woodx9 · 2024-04-16T17:11:38Z

Deepseek models support is in progress:

Deepseek coder merge #5464 (comment)

llama : add Deepseek support #5981

llama : add Deepseek support #5981 #6252

I have read all of them, thank You! I will see what can I do!

hyperbolic-c · 2024-04-17T09:16:41Z

Looking forward to your work!

chuckpaulson · 2024-04-18T20:37:35Z

I am trying to convert https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct using llama.cpp/convert.py
I have the following error when trying to convert, similar to the error with deepseek coder mentioned above. I am not able to fix this error, can anyone help?

command:
python llama.cpp/convert.py llama3-8b --outfile llama3-8b-8k-f16.gguf --outtype f16

output:
Loading model file llama3-8b/model-00001-of-00004.safetensors
Loading model file llama3-8b/model-00001-of-00004.safetensors
Loading model file llama3-8b/model-00002-of-00004.safetensors
Loading model file llama3-8b/model-00003-of-00004.safetensors
Loading model file llama3-8b/model-00004-of-00004.safetensors
params = Params(n_vocab=128256, n_embd=4096, n_layer=32, n_ctx=8192, n_ff=14336, n_head=32, n_head_kv=8, n_experts=None, n_experts_used=None, f_norm_eps=1e-05, rope_scaling_type=None, f_rope_freq_base=500000.0, f_rope_scale=None, n_orig_ctx=None, rope_finetuned=None, ftype=<GGMLFileType.MostlyF16: 1>, path_model=PosixPath('llama3-8b'))
Traceback (most recent call last):
File "/Users/charlespaulson/2024/llama_cpp/llama.cpp/convert.py", line 1548, in
main()
File "/Users/charlespaulson/2024/llama_cpp/llama.cpp/convert.py", line 1515, in main
vocab, special_vocab = vocab_factory.load_vocab(vocab_types, model_parent_path)
File "/Users/charlespaulson/2024/llama_cpp/llama.cpp/convert.py", line 1417, in load_vocab
vocab = self._create_vocab_by_path(vocab_types)
File "/Users/charlespaulson/2024/llama_cpp/llama.cpp/convert.py", line 1407, in _create_vocab_by_path
raise FileNotFoundError(f"Could not find a tokenizer matching any of {vocab_types}")
FileNotFoundError: Could not find a tokenizer matching any of ['spm', 'hfft']

professorf · 2024-04-19T20:06:35Z

add
--vocab-type bpe
to the command line

That should fix it.

dagelf · 2024-04-21T11:53:50Z

This also applies to using convert.py for converting the Meta distributed Llama3 files.

teleprint-me · 2024-04-22T02:51:32Z

Oh, it's been awhile, but I found it!

python convert.py local/models/deepseek-ai/deepseek-coder-6.7b-instruct --vocab-type hfft --pad-vocab

This is the original command I used. You need to use the --vocab-type and --pad-vocab options. ~~I forgot why,~~ it was related to PR #3633. You can read the rationale for it here.

The Meta distributed Llama3 files are currently unsupported. I've been working on it all day today to see if I can figure it out.

22:47:15 | /mnt/valerie/remote/ggerganov/llama.cpp
(.venv) git:(master | θ) λ python convert.py /mnt/valerie/models/meta-llama/Meta-Llama-3-8B-Instruct --vocab-type bpe
Loading model file /mnt/valerie/models/meta-llama/Meta-Llama-3-8B-Instruct/consolidated.00.pth
params = Params(n_vocab=128256, n_embd=4096, n_layer=32, n_ctx=4096, n_ff=14336, n_head=32, n_head_kv=8, n_experts=None, n_experts_used=None, f_norm_eps=1e-05, rope_scaling_type=None, f_rope_freq_base=500000.0, f_rope_scale=None, n_orig_ctx=None, rope_finetuned=None, ftype=None, path_model=PosixPath('/mnt/valerie/models/meta-llama/Meta-Llama-3-8B-Instruct'))
Traceback (most recent call last):
  File "/mnt/valerie/remote/ggerganov/llama.cpp/convert.py", line 1555, in <module>
    main()
  File "/mnt/valerie/remote/ggerganov/llama.cpp/convert.py", line 1522, in main
    vocab, special_vocab = vocab_factory.load_vocab(vocab_types, model_parent_path)
                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/mnt/valerie/remote/ggerganov/llama.cpp/convert.py", line 1424, in load_vocab
    vocab = self._create_vocab_by_path(vocab_types)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/mnt/valerie/remote/ggerganov/llama.cpp/convert.py", line 1414, in _create_vocab_by_path
    raise FileNotFoundError(f"Could not find a tokenizer matching any of {vocab_types}")
FileNotFoundError: Could not find a tokenizer matching any of ['bpe']

I have no idea what model format Meta used and that's the part I'm stuck on right now. torchtext also seems to use binary formats, not plaintext BPE formats, so that's why I'm stuck at the moment.

22:55:32 | ~/Local/vocab-model
(.venv)  λ bpython
bpython version 0.24 on top of Python 3.11.8 /home/austin/Local/vocab-model/.venv/bin/python
>>> tokenizer_model_path = "/mnt/scsm/models/facebook/llama-3/Meta-Llama-3-8B/tokenizer.model"
>>> tokenizer_model = open(tokenizer_model_path)
>>> vocab = [line.split() for line in tokenizer_model.readlines()]
>>> len(vocab)
128000
>>> vocab[0]
['IQ==', '0']
>>>  # This is kind of funny and apropos for how I'm feeling rn, lol

I have a couple ideas, but if anyone knows how to go about this, I'm all ears.

woodx9 · 2024-04-25T13:42:04Z

Oh, it's been awhile, but I found it!

python convert.py local/models/deepseek-ai/deepseek-coder-6.7b-instruct --vocab-type hfft --pad-vocab

This is the original command I used. You need to use the --vocab-type and --pad-vocab options. I forgot why, it was related to PR #3633. You can read the rationale for it here.

The Meta distributed Llama3 files are currently unsupported. I've been working on it all day today to see if I can figure it out.

22:47:15 | /mnt/valerie/remote/ggerganov/llama.cpp
(.venv) git:(master | θ) λ python convert.py /mnt/valerie/models/meta-llama/Meta-Llama-3-8B-Instruct --vocab-type bpe
Loading model file /mnt/valerie/models/meta-llama/Meta-Llama-3-8B-Instruct/consolidated.00.pth
params = Params(n_vocab=128256, n_embd=4096, n_layer=32, n_ctx=4096, n_ff=14336, n_head=32, n_head_kv=8, n_experts=None, n_experts_used=None, f_norm_eps=1e-05, rope_scaling_type=None, f_rope_freq_base=500000.0, f_rope_scale=None, n_orig_ctx=None, rope_finetuned=None, ftype=None, path_model=PosixPath('/mnt/valerie/models/meta-llama/Meta-Llama-3-8B-Instruct'))
Traceback (most recent call last):
  File "/mnt/valerie/remote/ggerganov/llama.cpp/convert.py", line 1555, in <module>
    main()
  File "/mnt/valerie/remote/ggerganov/llama.cpp/convert.py", line 1522, in main
    vocab, special_vocab = vocab_factory.load_vocab(vocab_types, model_parent_path)
                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/mnt/valerie/remote/ggerganov/llama.cpp/convert.py", line 1424, in load_vocab
    vocab = self._create_vocab_by_path(vocab_types)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/mnt/valerie/remote/ggerganov/llama.cpp/convert.py", line 1414, in _create_vocab_by_path
    raise FileNotFoundError(f"Could not find a tokenizer matching any of {vocab_types}")
FileNotFoundError: Could not find a tokenizer matching any of ['bpe']

I have no idea what model format Meta used and that's the part I'm stuck on right now. torchtext also seems to use binary formats, not plaintext BPE formats, so that's why I'm stuck at the moment.

22:55:32 | ~/Local/vocab-model
(.venv)  λ bpython
bpython version 0.24 on top of Python 3.11.8 /home/austin/Local/vocab-model/.venv/bin/python
>>> tokenizer_model_path = "/mnt/scsm/models/facebook/llama-3/Meta-Llama-3-8B/tokenizer.model"
>>> tokenizer_model = open(tokenizer_model_path)
>>> vocab = [line.split() for line in tokenizer_model.readlines()]
>>> len(vocab)
128000
>>> vocab[0]
['IQ==', '0']
>>>  # This is kind of funny and apropos for how I'm feeling rn, lol

I have a couple ideas, but if anyone knows how to go about this, I'm all ears.

Does hfft fit with the way Deepseek Tokenizer? I doubt it. Can you give a reason plz?

teleprint-me · 2024-04-25T19:25:15Z

@woodx9 I didn't create it so you'll need to read the linked rationale.

kentonbmax · 2024-05-28T12:47:39Z

Does this help? https://github.com/meta-llama/llama3/blob/main/llama/tokenizer.py

github-actions · 2024-07-13T01:06:59Z

This issue was closed because it has been inactive for 14 days since being marked as stale.

woodx9 added the bug-unconfirmed label Apr 15, 2024

teleprint-me mentioned this issue Apr 22, 2024

Refactor convert.py and add support for Metas official Llama 3 model #6819

Closed

4 tasks

github-actions bot added the stale label May 26, 2024

github-actions bot removed the stale label May 29, 2024

github-actions bot added the stale label Jun 28, 2024

github-actions bot closed this as completed Jul 13, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

can llama.cpp/convert.py support tokenizer rather than 'spm', 'bpe', 'hfft' #6690

can llama.cpp/convert.py support tokenizer rather than 'spm', 'bpe', 'hfft' #6690

woodx9 commented Apr 15, 2024

jukofyork commented Apr 16, 2024 •

edited

Loading

woodx9 commented Apr 16, 2024

phymbert commented Apr 16, 2024

woodx9 commented Apr 16, 2024

hyperbolic-c commented Apr 17, 2024 •

edited

Loading

chuckpaulson commented Apr 18, 2024

professorf commented Apr 19, 2024

dagelf commented Apr 21, 2024

teleprint-me commented Apr 22, 2024 •

edited

Loading

woodx9 commented Apr 25, 2024

teleprint-me commented Apr 25, 2024 •

edited

Loading

kentonbmax commented May 28, 2024

github-actions bot commented Jul 13, 2024

can llama.cpp/convert.py support tokenizer rather than 'spm', 'bpe', 'hfft' #6690

can llama.cpp/convert.py support tokenizer rather than 'spm', 'bpe', 'hfft' #6690

Comments

woodx9 commented Apr 15, 2024

Command

Output:

jukofyork commented Apr 16, 2024 • edited Loading

woodx9 commented Apr 16, 2024

phymbert commented Apr 16, 2024

woodx9 commented Apr 16, 2024

hyperbolic-c commented Apr 17, 2024 • edited Loading

chuckpaulson commented Apr 18, 2024

professorf commented Apr 19, 2024

dagelf commented Apr 21, 2024

teleprint-me commented Apr 22, 2024 • edited Loading

woodx9 commented Apr 25, 2024

teleprint-me commented Apr 25, 2024 • edited Loading

kentonbmax commented May 28, 2024

github-actions bot commented Jul 13, 2024

jukofyork commented Apr 16, 2024 •

edited

Loading

hyperbolic-c commented Apr 17, 2024 •

edited

Loading

teleprint-me commented Apr 22, 2024 •

edited

Loading

teleprint-me commented Apr 25, 2024 •

edited

Loading