Different tokenization than AutoTokenizer when word is adjacent to non-special added token #7049

JohanAR · 2024-05-02T19:38:49Z

llama.cpp commit: 6ecf318 (current master)

>>> from transformers import AutoTokenizer
>>> t = AutoTokenizer.from_pretrained("NousResearch/Nous-Hermes-2-Mixtral-8x7B-DPO")
>>> t("<|im_start|>")
{'input_ids': [1, 32001], 'attention_mask': [1, 1]}
>>> t("user")
{'input_ids': [1, 2188], 'attention_mask': [1, 1]}
>>> t("<|im_start|>user")
{'input_ids': [1, 32001, 2188], 'attention_mask': [1, 1, 1]}

llama.cpp example server with https://huggingface.co/RichardErkhov/NousResearch_-_Nous-Hermes-2-Mixtral-8x7B-DPO-gguf/blob/main/Nous-Hermes-2-Mixtral-8x7B-DPO.Q3_K_M.gguf

$ curl --header "Content-Type: application/json" --request POST http://127.0.0.1:8080/tokenize --data '{"content": "<|im_start|>"}'
{"tokens":[32001]}
$ curl --header "Content-Type: application/json" --request POST http://127.0.0.1:8080/tokenize --data '{"content": "user"}'
{"tokens":[2188]}
$ curl --header "Content-Type: application/json" --request POST http://127.0.0.1:8080/tokenize --data '{"content": "<|im_start|>user"}'
{"tokens":[32001,1838]}

llama.cpp changes the tokenization of the word "user" when it comes directly after the added token, AutoTokenizer does not.

Could it be relevant that <|im_start|> is not a special token in this model? tokenizer_config.json

A different model where <|im_start|> is a special token does not have this behaviour. tokenizer_config.json

The text was updated successfully, but these errors were encountered:

Jeximo · 2024-05-03T12:25:20Z

Could it be relevant that <|im_start|> is not a special token in this model? tokenizer_config.json.

I'm able to reproduce your results:
false <|im_start|> special config.json:

curl --header "Content-Type: application/json" --request POST http://127.0.0.1:8080/tokenize --data '{"content": "<|im_start|>"}'
{"tokens":[32001]}  

curl --header "Content-Type: application/json" --request POST http://127.0.0.1:8080/tokenize --data '{"content": "user"}'
{"tokens":[2188]}

curl --header "Content-Type: application/json" --request POST http://127.0.0.1:8080/tokenize --data '{"content": "<|im_start|>user"}'
{"tokens":[32001,1838]}

different model, true <|im_start|> special config.json:

curl --header "Content-Type: application/json" --request POST http://127.0.0.1:8080/tokenize --data '{"content": "<|im_start|>"}'
{"tokens":[128002]}

curl --header "Content-Type: application/json" --request POST http://127.0.0.1:8080/tokenize --data '{"content": "user"}'      
{"tokens":[882]}

curl --header "Content-Type: application/json" --request POST http://127.0.0.1:8080/tokenize --data '{"content": "<|im_start|>user"}'
{"tokens":[128002,882]}

false <|im_start|> special token in the tokenizer_config.json is why it's tokenized differently.

JohanAR · 2024-05-03T12:39:16Z

@Jeximo thanks! Updating issue title to reflect this

turian · 2024-05-06T23:26:10Z

I have also seen this issue. I have observed that passing HF tokenized text into llama.cpp gives good perplexity, but llama.cpp tokenized text gives bad perplexity.

ggerganov · 2024-05-07T09:22:15Z

There is logic in llama.cpp to add space prefix if the first token is not special:

llama.cpp/llama.cpp

Lines 12690 to 12695 in 947d3ad

    
           auto raw_text = fragment.raw_text.substr(fragment.offset, fragment.length); 
        
           if (&fragment == &fragment_buffer.front()) { 
        
               if (vocab.add_space_prefix) { 
        
                   raw_text = " " + raw_text; // prefix with space if the first token is not special 
        
               } 
        
           }

      "user": 1838,
      "▁user": 2188,

That's why "user" tokenizes to [2188] instead of [1838]

If <|im_start|> is not special then you need to add a space manually: "<|im_start|> user" should tokenize to [32001,2188]

JohanAR · 2024-05-07T19:35:27Z

Sounds reasonable. Is it a project goal to be compatible with Transformers, or is it acceptable that llama.cpp behaves differently? As far as I can tell, it tokenizes the string as [32001,2188] regardless of if <|im_start|> is special or not.

I'll just close the issue if it's not a problem that llama.cpp behaves differently than AutoTokenizer.

teleprint-me · 2024-05-07T20:01:50Z

Is it a project goal to be compatible with Transformers, or is it acceptable that llama.cpp behaves differently?

This is a valid question and one I'm keen to having answered.

ggerganov · 2024-05-08T12:01:00Z

Is it a project goal to be compatible with Transformers, or is it acceptable that llama.cpp behaves differently?

It's probably a good idea to stick to Transformers, so if there is a suggestion how to fix our logic we can merge it. I'm not 100% sure though

I think this might be one more instance of the topic about added tokens discussed in #7144

github-actions · 2024-06-24T01:24:37Z

This issue was closed because it has been inactive for 14 days since being marked as stale.

JohanAR added the bug-unconfirmed label May 2, 2024

JohanAR changed the title ~~Different tokenization than AutoTokenizer adjacent to added token (ChatML)~~ Different tokenization than AutoTokenizer when word is adjacent to non-special added token May 3, 2024

Jeximo mentioned this issue May 14, 2024

EOT token incorrectly set for Mistral-v0.2 trained with added ChatML tokens #7271

Closed

github-actions bot added the stale label Jun 8, 2024

github-actions bot closed this as completed Jun 24, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Different tokenization than AutoTokenizer when word is adjacent to non-special added token #7049

Different tokenization than AutoTokenizer when word is adjacent to non-special added token #7049

JohanAR commented May 2, 2024 •

edited

Loading

Jeximo commented May 3, 2024

JohanAR commented May 3, 2024

turian commented May 6, 2024

ggerganov commented May 7, 2024

JohanAR commented May 7, 2024 •

edited

Loading

teleprint-me commented May 7, 2024

ggerganov commented May 8, 2024

github-actions bot commented Jun 24, 2024

Different tokenization than AutoTokenizer when word is adjacent to non-special added token #7049

Different tokenization than AutoTokenizer when word is adjacent to non-special added token #7049

Comments

JohanAR commented May 2, 2024 • edited Loading

Jeximo commented May 3, 2024

JohanAR commented May 3, 2024

turian commented May 6, 2024

ggerganov commented May 7, 2024

JohanAR commented May 7, 2024 • edited Loading

teleprint-me commented May 7, 2024

ggerganov commented May 8, 2024

github-actions bot commented Jun 24, 2024

JohanAR commented May 2, 2024 •

edited

Loading

JohanAR commented May 7, 2024 •

edited

Loading