convert.py: When --vocab-only is passed, generate false but valid params #7027

20kdc · 2024-05-01T15:43:13Z

Vocab files presently include their source model hyperparameter information.

'Faking it' allows vocab and model creation solely from tokenizer.model or similar.

An example of how this might be used in the style of baby-llama (this should be considered under the same MIT license as the rest of this PR):
example.zip

Particular applications of these custom vocabs may be non-language non-safety-critical uses of LLMs where the versatility of the LLM model is useful (I was thinking virtual pets, personally), or small models being trained to work with languages where dedicating more tokenization effort to the language may help boost performance (for much the same reasons that tokens are used in the first place).

Note that if --pad-vocab is given, then this would alter the vocab based on the real params, so the params must be loaded in this case.

…ams to allow vocab creation solely from tokenizer.model An example of how this might be used in the style of baby-llama will be attached with this PR.

teleprint-me · 2024-05-02T01:05:55Z

You can extract the vocab and build the model from the extracted vocab already. Is this PR supposed to create a "dummy" vocab from an existing one? If so, why not create a script that gives more fine grained control to specify the desired params?

I think this idea has potential, I'm just not sure if this is the way to go about it. Just my two-cents. Take what I'm saying as a grain of salt.

20kdc · 2024-05-02T14:26:51Z

You can extract the vocab and build the model from the extracted vocab already. Is this PR supposed to create a "dummy" vocab from an existing one? If so, why not create a script that gives more fine grained control to specify the desired params?

No, this is the opposite. I highly recommend seeing the example ZIP to see what it's doing; in short, the example ZIP does not come with any model or model params.
This is for synthesis of entirely original vocabs and models from those vocabs.

teleprint-me · 2024-05-02T15:30:01Z

Is there an uncompressed version? I don't download random zip files.

Or even better yet, instructions on how to reproduce the results would be better.

I can pull your branch in and test that way after reviewing the changes more indepth.

20kdc · 2024-05-02T16:55:56Z

test.txt:

#
 #
  #
   #
  #
 #
#
 #
  #
   #
  #
 #
#
 #

(...repeat ad infinitum)

build-vocab:

#!/bin/sh
spm_train --input vocab-src.txt --model_prefix tokenizer --vocab_size 261 --byte_fallback true
../llama.cpp/convert.py . --vocab-only --vocab-type spm --outfile vocab.gguf

~~(edit: oops, the mv was on the wrong line and I missed it because I hadn't cleaned out the test environment)~~
~~(edit 2: turns out the .vocab file is not actually what's wanted and the mv is useless. fixed now)~~

train:

#!/bin/sh
../llama.cpp/build/bin/train-text-from-scratch --vocab-model vocab.gguf --train-data test.txt

vocab-src.txt:

#
 #
  #
   #

  #
 #

#
   #

The scripts are the instructions, because these require a lot of fiddly command line options and manually typing them every time is very annoying.

teleprint-me · 2024-05-03T03:56:28Z

Yeah, I'm sold. I think this is an excellent idea. I think we could probably create a custom script for this.

I'm thinking maybe we can do a convert-spm-to-gguf.py. It can take a tokenizer.model as input and then output a custom tokenizer.gguf. What do you think?

Something to note-as an aside-is that I think this is a great short term solution, but a more robust long term solution might be to properly implement a custom tokenizer. This is a great start, though.

20kdc · 2024-05-04T12:35:07Z

I'm thinking maybe we can do a convert-spm-to-gguf.py. It can take a tokenizer.model as input and then output a custom tokenizer.gguf. What do you think?

The problem with this approach is that it doesn't support all tokenizers that can be imported, which may be detrimental. SPM was used because it was the simplest to configure, but any tokenizer should be usable.

Something to note-as an aside-is that I think this is a great short term solution, but a more robust long term solution might be to properly implement a custom tokenizer. This is a great start, though.

There's definitely merit to this, but it would interfere when someone wants to use, say, sentencepiece training for their custom model.

slaren · 2024-05-04T15:13:03Z

Would it make more sense to modify the training examples to accept the vocab only models without hparams?

20kdc · 2024-05-04T17:48:42Z

Would it make more sense to modify the training examples to accept the vocab only models without hparams?

If that's doable, then it would, but then that's potentially a format break anyway...
I assumed that if it was as simple as that then vocab only conversions wouldn't include hparams in the first place; looking at the code kinda solidified this.
As it is that can always be a follow-up to this PR.

teleprint-me · 2024-05-05T04:18:14Z

@20kdc

The problem with this approach is that it doesn't support all tokenizers that can be imported, which may be detrimental. SPM was used because it was the simplest to configure, but any tokenizer should be usable.

I see what you mean. That wasn't my intention. I was excited in the moment when I thought about the potential. I think loading any vocabulary would be ideal which is the point of the conversion script. I was thinking that it would be nice to be able to specify the hyperparameters from the CLI.

There's definitely merit to this, but it would interfere when someone wants to use, say, sentencepiece training for their custom model.

The tokenizers are baked into the converted GGUF's so they can be inferenced after being trained/finetuned. So maybe I misunderstood?

My understanding is that the vocab is extracted from the source vocabulary and then converted to a GGUF compatible format. This allows us to train and finetune with the extracted vocab.

The idea with the conversion script is so that we can take a custom sentencepiece tokenizer (or any other tokenizer, vocab, etc) and convert it to a proper GGUF to use for training and finetuning. I just thought working with spm models first-to experiment with-would be easier.

In any case, the GGUF format bakes the tokenizer into the model, which is convenient. It's a detail I really appreciate.

20kdc · 2024-05-05T17:57:46Z

I see what you mean. That wasn't my intention. I was excited in the moment when I thought about the potential. I think loading any vocabulary would be ideal which is the point of the conversion script. I was thinking that it would be nice to be able to specify the hyperparameters from the CLI.

train-text-from-scratch ignores most hyperparameters from the vocab model (notably, it does not ignore n_vocab, but this is of course tied to the tokenizer); the other hyperparameters are specified to train-text-from-scratch. Vocab model 'parameters' are just dummies to keep the model loader happy. So there's no need to specify these hyperparameters during vocab conversion.

teleprint-me · 2024-05-06T04:54:42Z

Vocab model 'parameters' are just dummies to keep the model loader happy. So there's no need to specify these hyperparameters during vocab conversion.

We do need them to inference. Regardless, I agree with you. I think we're on the same page. As you suggested, it's probably best handled in another PR.

ggerganov · 2024-05-08T16:47:48Z

While working on an unrelated PR, it looks like there is a much simpler solution as @slaren suggested earlier:

JoanFM@b7ede48#diff-150dc86746a90bad4fc2c3334aeb9b5887b3adad3cc1459446717638605348efR3803-R3808

@@ -3800,6 +3800,12 @@ static void llm_load_hparams(
 
     // get hparams kv
     ml.get_key(LLM_KV_VOCAB_SIZE,           hparams.n_vocab,       false) || ml.get_arr_n(LLM_KV_TOKENIZER_LIST, hparams.n_vocab);
+
+    // everything past this point is not vocab-related
+    if (hparams.vocab_only) {
+        return;
+    }
+
     ml.get_key(LLM_KV_CONTEXT_LENGTH,       hparams.n_ctx_train);
     ml.get_key(LLM_KV_EMBEDDING_LENGTH,     hparams.n_embd);
     ml.get_key(LLM_KV_FEED_FORWARD_LENGTH,  hparams.n_ff);

This change would allow to load vocab-only models without any changes to the convert script. @20kdc If you could give this a try and confirm that it works, it might be better to revert the changes from this PR

20kdc · 2024-05-08T18:12:37Z

Most of the changes in this PR are necessary for the convert script to function at all when no hyperparameters are given.
If the PR is fully reverted then a vocab-only input model conversion will fail at params = Params.load(model_plus).
The main thing that would change given the above C++-side change is that OutputFile.write_vocab_only would no longer need full hyperparameters (but it would still require params.n_vocab for vocab padding logic).

teleprint-me · 2024-05-08T19:55:03Z

@ggerganov Is that used in gguf when writing the output file? Is there a way to leverage this?

convert.py: When --vocab-only is passed, generate false but valid par…

547ed8a

…ams to allow vocab creation solely from tokenizer.model An example of how this might be used in the style of baby-llama will be attached with this PR.

compilade mentioned this pull request May 3, 2024

convert.py: add python logging instead of print() #6511

Merged

Merge branch 'master' into easier-vocab-conversion

aa814fa

ggerganov approved these changes May 8, 2024

View reviewed changes

ggerganov merged commit ad211ed into ggerganov:master May 8, 2024
22 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

convert.py: When --vocab-only is passed, generate false but valid params #7027

convert.py: When --vocab-only is passed, generate false but valid params #7027

20kdc commented May 1, 2024

teleprint-me commented May 2, 2024 •

edited

Loading

20kdc commented May 2, 2024

teleprint-me commented May 2, 2024 •

edited

Loading

20kdc commented May 2, 2024 •

edited

Loading

teleprint-me commented May 3, 2024 •

edited

Loading

20kdc commented May 4, 2024

slaren commented May 4, 2024

20kdc commented May 4, 2024 •

edited

Loading

teleprint-me commented May 5, 2024

20kdc commented May 5, 2024 •

edited

Loading

teleprint-me commented May 6, 2024

ggerganov commented May 8, 2024 •

edited

Loading

20kdc commented May 8, 2024 •

edited

Loading

teleprint-me commented May 8, 2024

convert.py: When --vocab-only is passed, generate false but valid params #7027

convert.py: When --vocab-only is passed, generate false but valid params #7027

Conversation

20kdc commented May 1, 2024

teleprint-me commented May 2, 2024 • edited Loading

20kdc commented May 2, 2024

teleprint-me commented May 2, 2024 • edited Loading

20kdc commented May 2, 2024 • edited Loading

teleprint-me commented May 3, 2024 • edited Loading

20kdc commented May 4, 2024

slaren commented May 4, 2024

20kdc commented May 4, 2024 • edited Loading

teleprint-me commented May 5, 2024

20kdc commented May 5, 2024 • edited Loading

teleprint-me commented May 6, 2024

ggerganov commented May 8, 2024 • edited Loading

20kdc commented May 8, 2024 • edited Loading

teleprint-me commented May 8, 2024

teleprint-me commented May 2, 2024 •

edited

Loading

teleprint-me commented May 2, 2024 •

edited

Loading

20kdc commented May 2, 2024 •

edited

Loading

teleprint-me commented May 3, 2024 •

edited

Loading

20kdc commented May 4, 2024 •

edited

Loading

20kdc commented May 5, 2024 •

edited

Loading

ggerganov commented May 8, 2024 •

edited

Loading

20kdc commented May 8, 2024 •

edited

Loading