supporting more diverse tokenizers #2420

eric8607242 · 2023-07-27T09:00:45Z

Hi, thanks for this awesome work.

I am currently trying to perform inference on different LLM (e.g., xgen and Aquila) using this project.

I always encounter issues with generating Chinese text smoothly.
By adopting the flag --verbose-prompt, I found that the Chinese words are always being tokenized into wrong token IDs.
After digging into the root cause, I found the reason is that the Chinese characters, which are composed of multiple bytes, are always tokenized incorrectly by this part.

llama_vocab::id token_id = static_cast<uint8_t>(symbol.text[j]) + 3;

This code can work for the llama series of models primarily because the llama's tokenizer follows the char
coding order and three special tokens are placed at the beginning:

'<unk>': 0,
'<s>': 1,
'</s>': 2,
'<0x00>': 3,
'<0x01>': 4,
'<0x02>': 5,
'<0x03>': 6,
...

Unfortunately, not all open-source pre-trained models adopt llama's tokenizer such as xgen and Aquila mentioned above.
Therefore, for more flexible support for more diverse pre-trained model tokenizers. I believe we should use the vocabulary generated by convert.py appropriately in this case.

For example, the xgen's tokenizer map looks like:

b'!': 0,
b'"': 1,
b'#': 2,
b'$': 3,
b'%': 4,
b'&': 5,
b"'": 6,
b'(': 7,
...

Although this PR only modifies one line of code, it brings significant benefits for supporting more models with UTF-8 characters. Just like #2228, enabling only BPE in convert.py is not sufficient to successfully infer Chinese words without this modification.

Big thanks for this amazing work again!

clyang · 2023-07-27T09:59:27Z

Having the same issue and I can confirm that this PR fixes my problem!

ggerganov · 2023-07-27T10:03:32Z

Will need some help here as I don't feel confident about the inner workings of the tokenizer.
Would this change only have positive effects, or are there any potential regressions that could result from it?

klosax · 2023-07-27T10:46:44Z

I dont know if this have any negative effect, but it looks like that this is a temporary solution.

Neither of the mentioned models are trained using the llama tokenizer. Xgen uses tiktoken and Aquila uses the gpt2 tokenizer. Support for using different tokenizers will be easy to add in gguf. The most common tokenizer is gpt2 and will be supported from start since it is already implemented in the ggml examples.

llama.cpp

goerch · 2023-08-06T06:13:37Z

This change seems to conflict with the proposed fixes for the LLaMa tokenizer.

ggerganov · 2023-08-06T07:02:12Z

Should we revert it or we can adapt the PR to this change?

goerch · 2023-08-06T07:11:54Z

I'm trying to test Aquila-7B.

klosax · 2023-08-06T09:11:50Z

This looks like a temporary solution. In GGUF we have support for a real gpt2 tokenizer since it supports adding the merges. In PR #2398 there is an gptneox example with an excellent gpt2 tokenizer supporting both merges and unicode.

goerch · 2023-08-06T11:47:03Z

Should we revert it or we can adapt the PR to this change?

After working through some issues with #2487 I have adapted the PR to the best of my knowledge. test-tokenizer-1 is somehow working for Aquila-7B, but I certainly need help here, especially with character classification. An adaption of test-tokenizer-0 seems most desirable, too.

This looks like a temporary solution. In GGUF we have support for a real gpt2 tokenizer since it supports adding the merges. In PR #2398 there is an gptneox example with an excellent gpt2 tokenizer supporting both merges and unicode.

Should we continue to test with the main branch or merge the PR into GGUF now?

supporting more diverse tokenizers

efb5dac

eric8607242 mentioned this pull request Jul 27, 2023

Enable support more diverse tokenizers #2418

Closed

ggerganov added the help wanted Extra attention is needed label Jul 27, 2023

ggerganov reviewed Jul 28, 2023

View reviewed changes

llama.cpp Show resolved Hide resolved

Update llama.cpp

a8ee520

ggerganov merged commit ee1b497 into ggml-org:master Jul 28, 2023

This was referenced Aug 7, 2023

llama : fix tokenizer #2315

Closed

Merge tokenizer fixes #2549

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

supporting more diverse tokenizers #2420

supporting more diverse tokenizers #2420

eric8607242 commented Jul 27, 2023

clyang commented Jul 27, 2023

ggerganov commented Jul 27, 2023

klosax commented Jul 27, 2023

goerch commented Aug 6, 2023

ggerganov commented Aug 6, 2023

goerch commented Aug 6, 2023

klosax commented Aug 6, 2023

goerch commented Aug 6, 2023

supporting more diverse tokenizers #2420

supporting more diverse tokenizers #2420

Conversation

eric8607242 commented Jul 27, 2023

clyang commented Jul 27, 2023

ggerganov commented Jul 27, 2023

klosax commented Jul 27, 2023

goerch commented Aug 6, 2023

ggerganov commented Aug 6, 2023

goerch commented Aug 6, 2023

klosax commented Aug 6, 2023

goerch commented Aug 6, 2023