Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

can llama.cpp/convert.py support tokenizer rather than 'spm', 'bpe', 'hfft' #6690

Closed
woodx9 opened this issue Apr 15, 2024 · 13 comments
Closed

Comments

@woodx9
Copy link
Contributor

woodx9 commented Apr 15, 2024

I am trying to convert deepseek-ai/deepseek-coder-1.3b-base using llama.cpp/convert.py
with

Command

python llama.cpp/convert.py codes-hf
--outfile codes-1b.gguf
--outtype q8_0

Output:

Loading model file codes-hf/pytorch_model.bin
params = Params(n_vocab=32256, n_embd=2048, n_layer=24, n_ctx=16384, n_ff=5504, n_head=16, n_head_kv=16, n_experts=None, n_experts_used=None, f_norm_eps=1e-06, rope_scaling_type=<RopeScalingType.LINEAR: 'linear'>, f_rope_freq_base=100000, f_rope_scale=4.0, n_orig_ctx=None, rope_finetuned=None, ftype=<GGMLFileType.MostlyQ8_0: 7>, path_model=PosixPath('codes-hf'))
Traceback (most recent call last):
File "/home/woodx/Workspace/llamacpp/llama.cpp/convert.py", line 1548, in
main()
File "/home/woodx/Workspace/llamacpp/llama.cpp/convert.py", line 1515, in main
vocab, special_vocab = vocab_factory.load_vocab(vocab_types, model_parent_path)
File "/home/woodx/Workspace/llamacpp/llama.cpp/convert.py", line 1417, in load_vocab
vocab = self._create_vocab_by_path(vocab_types)
File "/home/woodx/Workspace/llamacpp/llama.cpp/convert.py", line 1407, in _create_vocab_by_path
raise FileNotFoundError(f"Could not find a tokenizer matching any of {vocab_types}")
FileNotFoundError: Could not find a tokenizer matching any of ['spm', 'hfft']

the "tokenizer_class": "LlamaTokenizerFast", is there a way to support it?

@jukofyork
Copy link
Contributor

jukofyork commented Apr 16, 2024

Try adding --vocab-type bpe as an opinion. IIRC, I had to do that for deepseek-coder models.

@woodx9
Copy link
Contributor Author

woodx9 commented Apr 16, 2024

Try adding --vocab-type bpe as an opinion. IIRC, I had to do that for deepseek-coder models.

I try that, But I think it's not the right choice, they are two different tokenizer way after all. And I have a question of how to tokenizer them with --no-vocab

@phymbert
Copy link
Collaborator

Deepseek models support is in progress:

@woodx9
Copy link
Contributor Author

woodx9 commented Apr 16, 2024

Deepseek models support is in progress:

I have read all of them, thank You! I will see what can I do!

@hyperbolic-c
Copy link

hyperbolic-c commented Apr 17, 2024

Looking forward to your work!

@chuckpaulson
Copy link

I am trying to convert https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct using llama.cpp/convert.py
I have the following error when trying to convert, similar to the error with deepseek coder mentioned above. I am not able to fix this error, can anyone help?

command:
python llama.cpp/convert.py llama3-8b --outfile llama3-8b-8k-f16.gguf --outtype f16

output:
Loading model file llama3-8b/model-00001-of-00004.safetensors
Loading model file llama3-8b/model-00001-of-00004.safetensors
Loading model file llama3-8b/model-00002-of-00004.safetensors
Loading model file llama3-8b/model-00003-of-00004.safetensors
Loading model file llama3-8b/model-00004-of-00004.safetensors
params = Params(n_vocab=128256, n_embd=4096, n_layer=32, n_ctx=8192, n_ff=14336, n_head=32, n_head_kv=8, n_experts=None, n_experts_used=None, f_norm_eps=1e-05, rope_scaling_type=None, f_rope_freq_base=500000.0, f_rope_scale=None, n_orig_ctx=None, rope_finetuned=None, ftype=<GGMLFileType.MostlyF16: 1>, path_model=PosixPath('llama3-8b'))
Traceback (most recent call last):
File "/Users/charlespaulson/2024/llama_cpp/llama.cpp/convert.py", line 1548, in
main()
File "/Users/charlespaulson/2024/llama_cpp/llama.cpp/convert.py", line 1515, in main
vocab, special_vocab = vocab_factory.load_vocab(vocab_types, model_parent_path)
File "/Users/charlespaulson/2024/llama_cpp/llama.cpp/convert.py", line 1417, in load_vocab
vocab = self._create_vocab_by_path(vocab_types)
File "/Users/charlespaulson/2024/llama_cpp/llama.cpp/convert.py", line 1407, in _create_vocab_by_path
raise FileNotFoundError(f"Could not find a tokenizer matching any of {vocab_types}")
FileNotFoundError: Could not find a tokenizer matching any of ['spm', 'hfft']

@professorf
Copy link

add
--vocab-type bpe
to the command line

That should fix it.

@dagelf
Copy link

dagelf commented Apr 21, 2024

This also applies to using convert.py for converting the Meta distributed Llama3 files.

@teleprint-me
Copy link
Contributor

teleprint-me commented Apr 22, 2024

Oh, it's been awhile, but I found it!

python convert.py local/models/deepseek-ai/deepseek-coder-6.7b-instruct --vocab-type hfft --pad-vocab

This is the original command I used. You need to use the --vocab-type and --pad-vocab options. I forgot why, it was related to PR #3633. You can read the rationale for it here.

The Meta distributed Llama3 files are currently unsupported. I've been working on it all day today to see if I can figure it out.

22:47:15 | /mnt/valerie/remote/ggerganov/llama.cpp
(.venv) git:(master | θ) λ python convert.py /mnt/valerie/models/meta-llama/Meta-Llama-3-8B-Instruct --vocab-type bpe
Loading model file /mnt/valerie/models/meta-llama/Meta-Llama-3-8B-Instruct/consolidated.00.pth
params = Params(n_vocab=128256, n_embd=4096, n_layer=32, n_ctx=4096, n_ff=14336, n_head=32, n_head_kv=8, n_experts=None, n_experts_used=None, f_norm_eps=1e-05, rope_scaling_type=None, f_rope_freq_base=500000.0, f_rope_scale=None, n_orig_ctx=None, rope_finetuned=None, ftype=None, path_model=PosixPath('/mnt/valerie/models/meta-llama/Meta-Llama-3-8B-Instruct'))
Traceback (most recent call last):
  File "/mnt/valerie/remote/ggerganov/llama.cpp/convert.py", line 1555, in <module>
    main()
  File "/mnt/valerie/remote/ggerganov/llama.cpp/convert.py", line 1522, in main
    vocab, special_vocab = vocab_factory.load_vocab(vocab_types, model_parent_path)
                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/mnt/valerie/remote/ggerganov/llama.cpp/convert.py", line 1424, in load_vocab
    vocab = self._create_vocab_by_path(vocab_types)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/mnt/valerie/remote/ggerganov/llama.cpp/convert.py", line 1414, in _create_vocab_by_path
    raise FileNotFoundError(f"Could not find a tokenizer matching any of {vocab_types}")
FileNotFoundError: Could not find a tokenizer matching any of ['bpe']

I have no idea what model format Meta used and that's the part I'm stuck on right now. torchtext also seems to use binary formats, not plaintext BPE formats, so that's why I'm stuck at the moment.

22:55:32 | ~/Local/vocab-model
(.venv)  λ bpython
bpython version 0.24 on top of Python 3.11.8 /home/austin/Local/vocab-model/.venv/bin/python
>>> tokenizer_model_path = "/mnt/scsm/models/facebook/llama-3/Meta-Llama-3-8B/tokenizer.model"
>>> tokenizer_model = open(tokenizer_model_path)
>>> vocab = [line.split() for line in tokenizer_model.readlines()]
>>> len(vocab)
128000
>>> vocab[0]
['IQ==', '0']
>>>  # This is kind of funny and apropos for how I'm feeling rn, lol

I have a couple ideas, but if anyone knows how to go about this, I'm all ears.

@woodx9
Copy link
Contributor Author

woodx9 commented Apr 25, 2024

Oh, it's been awhile, but I found it!

python convert.py local/models/deepseek-ai/deepseek-coder-6.7b-instruct --vocab-type hfft --pad-vocab

This is the original command I used. You need to use the --vocab-type and --pad-vocab options. I forgot why, it was related to PR #3633. You can read the rationale for it here.

The Meta distributed Llama3 files are currently unsupported. I've been working on it all day today to see if I can figure it out.

22:47:15 | /mnt/valerie/remote/ggerganov/llama.cpp
(.venv) git:(master | θ) λ python convert.py /mnt/valerie/models/meta-llama/Meta-Llama-3-8B-Instruct --vocab-type bpe
Loading model file /mnt/valerie/models/meta-llama/Meta-Llama-3-8B-Instruct/consolidated.00.pth
params = Params(n_vocab=128256, n_embd=4096, n_layer=32, n_ctx=4096, n_ff=14336, n_head=32, n_head_kv=8, n_experts=None, n_experts_used=None, f_norm_eps=1e-05, rope_scaling_type=None, f_rope_freq_base=500000.0, f_rope_scale=None, n_orig_ctx=None, rope_finetuned=None, ftype=None, path_model=PosixPath('/mnt/valerie/models/meta-llama/Meta-Llama-3-8B-Instruct'))
Traceback (most recent call last):
  File "/mnt/valerie/remote/ggerganov/llama.cpp/convert.py", line 1555, in <module>
    main()
  File "/mnt/valerie/remote/ggerganov/llama.cpp/convert.py", line 1522, in main
    vocab, special_vocab = vocab_factory.load_vocab(vocab_types, model_parent_path)
                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/mnt/valerie/remote/ggerganov/llama.cpp/convert.py", line 1424, in load_vocab
    vocab = self._create_vocab_by_path(vocab_types)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/mnt/valerie/remote/ggerganov/llama.cpp/convert.py", line 1414, in _create_vocab_by_path
    raise FileNotFoundError(f"Could not find a tokenizer matching any of {vocab_types}")
FileNotFoundError: Could not find a tokenizer matching any of ['bpe']

I have no idea what model format Meta used and that's the part I'm stuck on right now. torchtext also seems to use binary formats, not plaintext BPE formats, so that's why I'm stuck at the moment.

22:55:32 | ~/Local/vocab-model
(.venv)  λ bpython
bpython version 0.24 on top of Python 3.11.8 /home/austin/Local/vocab-model/.venv/bin/python
>>> tokenizer_model_path = "/mnt/scsm/models/facebook/llama-3/Meta-Llama-3-8B/tokenizer.model"
>>> tokenizer_model = open(tokenizer_model_path)
>>> vocab = [line.split() for line in tokenizer_model.readlines()]
>>> len(vocab)
128000
>>> vocab[0]
['IQ==', '0']
>>>  # This is kind of funny and apropos for how I'm feeling rn, lol

I have a couple ideas, but if anyone knows how to go about this, I'm all ears.

Does hfft fit with the way Deepseek Tokenizer? I doubt it. Can you give a reason plz?

@teleprint-me
Copy link
Contributor

teleprint-me commented Apr 25, 2024

@woodx9 I didn't create it so you'll need to read the linked rationale.

@github-actions github-actions bot added the stale label May 26, 2024
@kentonbmax
Copy link

@github-actions github-actions bot removed the stale label May 29, 2024
@github-actions github-actions bot added the stale label Jun 28, 2024
Copy link
Contributor

This issue was closed because it has been inactive for 14 days since being marked as stale.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

9 participants