Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

supporting more diverse tokenizers #2420

Merged
merged 2 commits into from
Jul 28, 2023

Conversation

eric8607242
Copy link
Contributor

Hi, thanks for this awesome work.

I am currently trying to perform inference on different LLM (e.g., xgen and Aquila) using this project.

I always encounter issues with generating Chinese text smoothly.
By adopting the flag --verbose-prompt, I found that the Chinese words are always being tokenized into wrong token IDs.
After digging into the root cause, I found the reason is that the Chinese characters, which are composed of multiple bytes, are always tokenized incorrectly by this part.

llama_vocab::id token_id = static_cast<uint8_t>(symbol.text[j]) + 3;

This code can work for the llama series of models primarily because the llama's tokenizer follows the char
coding order and three special tokens are placed at the beginning:

'<unk>': 0,
'<s>': 1,
'</s>': 2,
'<0x00>': 3,
'<0x01>': 4,
'<0x02>': 5,
'<0x03>': 6,
...

Unfortunately, not all open-source pre-trained models adopt llama's tokenizer such as xgen and Aquila mentioned above.
Therefore, for more flexible support for more diverse pre-trained model tokenizers. I believe we should use the vocabulary generated by convert.py appropriately in this case.

For example, the xgen's tokenizer map looks like:

b'!': 0,
b'"': 1,
b'#': 2,
b'$': 3,
b'%': 4,
b'&': 5,
b"'": 6,
b'(': 7,
...

Although this PR only modifies one line of code, it brings significant benefits for supporting more models with UTF-8 characters. Just like #2228, enabling only BPE in convert.py is not sufficient to successfully infer Chinese words without this modification.

Big thanks for this amazing work again!

@clyang
Copy link
Contributor

clyang commented Jul 27, 2023

Having the same issue and I can confirm that this PR fixes my problem!

@ggerganov ggerganov added the help wanted Extra attention is needed label Jul 27, 2023
@ggerganov
Copy link
Member

Will need some help here as I don't feel confident about the inner workings of the tokenizer.
Would this change only have positive effects, or are there any potential regressions that could result from it?

@klosax
Copy link
Contributor

klosax commented Jul 27, 2023

I dont know if this have any negative effect, but it looks like that this is a temporary solution.

Neither of the mentioned models are trained using the llama tokenizer. Xgen uses tiktoken and Aquila uses the gpt2 tokenizer. Support for using different tokenizers will be easy to add in gguf. The most common tokenizer is gpt2 and will be supported from start since it is already implemented in the ggml examples.

@ggerganov ggerganov merged commit ee1b497 into ggml-org:master Jul 28, 2023
@goerch
Copy link
Collaborator

goerch commented Aug 6, 2023

This change seems to conflict with the proposed fixes for the LLaMa tokenizer.

@ggerganov
Copy link
Member

Should we revert it or we can adapt the PR to this change?

@goerch
Copy link
Collaborator

goerch commented Aug 6, 2023

I'm trying to test Aquila-7B.

@klosax
Copy link
Contributor

klosax commented Aug 6, 2023

This looks like a temporary solution. In GGUF we have support for a real gpt2 tokenizer since it supports adding the merges. In PR #2398 there is an gptneox example with an excellent gpt2 tokenizer supporting both merges and unicode.

@goerch
Copy link
Collaborator

goerch commented Aug 6, 2023

Should we revert it or we can adapt the PR to this change?

After working through some issues with #2487 I have adapted the PR to the best of my knowledge. test-tokenizer-1 is somehow working for Aquila-7B, but I certainly need help here, especially with character classification. An adaption of test-tokenizer-0 seems most desirable, too.

This looks like a temporary solution. In GGUF we have support for a real gpt2 tokenizer since it supports adding the merges. In PR #2398 there is an gptneox example with an excellent gpt2 tokenizer supporting both merges and unicode.

Should we continue to test with the main branch or merge the PR into GGUF now?

This was referenced Aug 7, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted Extra attention is needed
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants