-
Notifications
You must be signed in to change notification settings - Fork 10.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
convert.py: When --vocab-only is passed, generate false but valid params #7027
Conversation
…ams to allow vocab creation solely from tokenizer.model An example of how this might be used in the style of baby-llama will be attached with this PR.
You can extract the vocab and build the model from the extracted vocab already. Is this PR supposed to create a "dummy" vocab from an existing one? If so, why not create a script that gives more fine grained control to specify the desired params? I think this idea has potential, I'm just not sure if this is the way to go about it. Just my two-cents. Take what I'm saying as a grain of salt. |
No, this is the opposite. I highly recommend seeing the example ZIP to see what it's doing; in short, the example ZIP does not come with any model or model params. |
Is there an uncompressed version? I don't download random zip files. Or even better yet, instructions on how to reproduce the results would be better. I can pull your branch in and test that way after reviewing the changes more indepth. |
(...repeat ad infinitum)
The scripts are the instructions, because these require a lot of fiddly command line options and manually typing them every time is very annoying. |
Yeah, I'm sold. I think this is an excellent idea. I think we could probably create a custom script for this. I'm thinking maybe we can do a Something to note-as an aside-is that I think this is a great short term solution, but a more robust long term solution might be to properly implement a custom tokenizer. This is a great start, though. |
The problem with this approach is that it doesn't support all tokenizers that can be imported, which may be detrimental. SPM was used because it was the simplest to configure, but any tokenizer should be usable.
There's definitely merit to this, but it would interfere when someone wants to use, say, sentencepiece training for their custom model. |
Would it make more sense to modify the training examples to accept the vocab only models without hparams? |
If that's doable, then it would, but then that's potentially a format break anyway... |
I see what you mean. That wasn't my intention. I was excited in the moment when I thought about the potential. I think loading any vocabulary would be ideal which is the point of the conversion script. I was thinking that it would be nice to be able to specify the hyperparameters from the CLI.
The tokenizers are baked into the converted GGUF's so they can be inferenced after being trained/finetuned. So maybe I misunderstood? My understanding is that the vocab is extracted from the source vocabulary and then converted to a GGUF compatible format. This allows us to train and finetune with the extracted vocab. The idea with the conversion script is so that we can take a custom sentencepiece tokenizer (or any other tokenizer, vocab, etc) and convert it to a proper GGUF to use for training and finetuning. I just thought working with spm models first-to experiment with-would be easier. In any case, the GGUF format bakes the tokenizer into the model, which is convenient. It's a detail I really appreciate. |
train-text-from-scratch ignores most hyperparameters from the vocab model (notably, it does not ignore |
We do need them to inference. Regardless, I agree with you. I think we're on the same page. As you suggested, it's probably best handled in another PR. |
While working on an unrelated PR, it looks like there is a much simpler solution as @slaren suggested earlier: JoanFM@b7ede48#diff-150dc86746a90bad4fc2c3334aeb9b5887b3adad3cc1459446717638605348efR3803-R3808 @@ -3800,6 +3800,12 @@ static void llm_load_hparams(
// get hparams kv
ml.get_key(LLM_KV_VOCAB_SIZE, hparams.n_vocab, false) || ml.get_arr_n(LLM_KV_TOKENIZER_LIST, hparams.n_vocab);
+
+ // everything past this point is not vocab-related
+ if (hparams.vocab_only) {
+ return;
+ }
+
ml.get_key(LLM_KV_CONTEXT_LENGTH, hparams.n_ctx_train);
ml.get_key(LLM_KV_EMBEDDING_LENGTH, hparams.n_embd);
ml.get_key(LLM_KV_FEED_FORWARD_LENGTH, hparams.n_ff); This change would allow to load vocab-only models without any changes to the convert script. @20kdc If you could give this a try and confirm that it works, it might be better to revert the changes from this PR |
Most of the changes in this PR are necessary for the convert script to function at all when no hyperparameters are given. |
@ggerganov Is that used in |
Vocab files presently include their source model hyperparameter information.
'Faking it' allows vocab and model creation solely from
tokenizer.model
or similar.An example of how this might be used in the style of baby-llama (this should be considered under the same MIT license as the rest of this PR):
example.zip
Particular applications of these custom vocabs may be non-language non-safety-critical uses of LLMs where the versatility of the LLM model is useful (I was thinking virtual pets, personally), or small models being trained to work with languages where dedicating more tokenization effort to the language may help boost performance (for much the same reasons that tokens are used in the first place).
Note that if
--pad-vocab
is given, then this would alter the vocab based on the real params, so the params must be loaded in this case.