Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

convert : refactor vocab selection logic #6355

Merged
merged 13 commits into from
Mar 28, 2024
Merged

Conversation

cebtenzzre
Copy link
Collaborator

This PR fixes some confusion as to the purpose of HfVocab, by making it explicit that it is only for LLaMA "SPM" vocabularies in tokenizer.json format, not generic HuggingFace fast tokenizer (tokenizer.json) vocabs. (There is one exception to this, which is its use for WordPiece - this will be corrected in a follow-up PR.)

PR #5821 fixed some of the confusion as to which files map to which tokenizers, but in adding the automatic fallback to HfVocab it unintentionally caused a few issues.

This PR makes it the job of each vocab class to attempt to load the vocab from the appropriate files, and to fail if tokenizer.json represents the wrong vocab type.

I also changed the Vocab Union to a pair of Protocols to make the API a little more explicit.


With these changes, converting e.g. deepseek-llm-7b-chat results in this exception with the default --vocab-type:

FileNotFoundError: Could not find a tokenizer matching any of ['spm', 'hfft']

And converting with --vocab-type bpe --pad-vocab works as expected.

With #5821, the model would appear to convert successfully with the default --vocab-type but fail at runtime, and --vocab-type bpe did not recognize the model.

Prior to #5821, the presence of tokenizer.json caused convert.py to attempt to load it as a sentencepiece model:

RuntimeError: Internal: could not parse ModelProto from /home/jared/dirs/text-ai-models/dl/deepseek-llm-7b-chat/tokenizer.json

Closes #6245
Fixes #6238
Fixes #6216
Fixes #5973

@cebtenzzre cebtenzzre requested a review from ggerganov March 27, 2024 22:20
@cebtenzzre cebtenzzre merged commit be55134 into master Mar 28, 2024
54 of 60 checks passed
@cebtenzzre cebtenzzre deleted the ceb/fix-convert-bpe-hf branch March 28, 2024 15:44
hodlen pushed a commit to hodlen/llama.cpp that referenced this pull request Apr 1, 2024
hodlen pushed a commit to hodlen/llama.cpp that referenced this pull request Apr 3, 2024
tybalex pushed a commit to rubra-ai/tools.cpp that referenced this pull request Apr 17, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment