-
Notifications
You must be signed in to change notification settings - Fork 10.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
supporting more diverse tokenizers #2420
Conversation
Having the same issue and I can confirm that this PR fixes my problem! |
Will need some help here as I don't feel confident about the inner workings of the tokenizer. |
I dont know if this have any negative effect, but it looks like that this is a temporary solution. Neither of the mentioned models are trained using the llama tokenizer. Xgen uses tiktoken and Aquila uses the gpt2 tokenizer. Support for using different tokenizers will be easy to add in gguf. The most common tokenizer is gpt2 and will be supported from start since it is already implemented in the ggml examples. |
This change seems to conflict with the proposed fixes for the LLaMa tokenizer. |
Should we revert it or we can adapt the PR to this change? |
I'm trying to test Aquila-7B. |
This looks like a temporary solution. In GGUF we have support for a real gpt2 tokenizer since it supports adding the merges. In PR #2398 there is an gptneox example with an excellent gpt2 tokenizer supporting both merges and unicode. |
After working through some issues with #2487 I have adapted the PR to the best of my knowledge.
Should we continue to test with the |
Hi, thanks for this awesome work.
I am currently trying to perform inference on different LLM (e.g., xgen and Aquila) using this project.
I always encounter issues with generating Chinese text smoothly.
By adopting the flag
--verbose-prompt
, I found that the Chinese words are always being tokenized into wrong token IDs.After digging into the root cause, I found the reason is that the Chinese characters, which are composed of multiple bytes, are always tokenized incorrectly by this part.
This code can work for the llama series of models primarily because the llama's tokenizer follows the char
coding order and three special tokens are placed at the beginning:
Unfortunately, not all open-source pre-trained models adopt llama's tokenizer such as xgen and Aquila mentioned above.
Therefore, for more flexible support for more diverse pre-trained model tokenizers. I believe we should use the vocabulary generated by
convert.py
appropriately in this case.For example, the xgen's tokenizer map looks like:
Although this PR only modifies one line of code, it brings significant benefits for supporting more models with UTF-8 characters. Just like #2228, enabling only BPE in convert.py is not sufficient to successfully infer Chinese words without this modification.
Big thanks for this amazing work again!