Tokenizer broken for mixtral models #1009

Azirine · 2024-07-17T23:49:45Z

With Mixtral 8x7b, [INST] and [/INST] are not tokenized correctly.

[Debug: Dump Forwarded Input Tokens, format: 6]
'  (28705)', '\n (13)', '[ (28792)', 'INST (16289)', '] (28793)', ' hi (12014)', ' [ (733)', '/ (28748)', 'INST (16289)', '] (28793)', '\n (13)',

Model: https://huggingface.co/mistralai/Mixtral-8x7B-Instruct-v0.1
Instruct Tag Preset: Mistral

Ggerganov also noted this problem on llama.cpp, although there is only happens for Mixtral 8x22b, not 8x7b.
ggml-org#7969 (comment)

With WizardLM-2-8x22B, this also happens with USER: and ASSISTANT:.

[Debug: Dump Forwarded Input Tokens, format: 6]
'<s> (1)', '  (28705)', '\n (13)', 'USER (11123)', ': (28747)', ' hi (12014)', '\n (13)', 'ASS (4816)', 'IST (8048)', 'ANT (12738)', ': (28747)', '  (28705)',

Model: https://huggingface.co/alpindale/WizardLM-2-8x22B
Instruct Tag Preset: Vicuna

Format: Instruct Mode
Koboldcpp 1.70.1

The text was updated successfully, but these errors were encountered:

Azirine · 2024-07-20T11:14:17Z

I tested that this is not caused by b3028 on llama.cpp, because versions of koboldcpp prior to b3028 already have this issue.

LostRuins · 2024-07-21T07:12:07Z

Actually I don't think this is a bug. I'm looking at the mixtral vocab, and [INST] is not a token.

https://huggingface.co/neuralmagic/Mixtral-8x7B-Instruct-v0.1-FP8/raw/main/tokenizer.json
https://huggingface.co/Doctor-Shotgun/Mixtral-8x7B-Instruct-v0.1-LimaRP-ZLoss/raw/main/tokenizer.json

As far as I can see there are no mixtral models that use that added token.

Azirine changed the title ~~WizardLM-2-8x22B tokenizer fault~~ Tokenizer broken for mixtral models Jul 20, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tokenizer broken for mixtral models #1009

Tokenizer broken for mixtral models #1009

Azirine commented Jul 17, 2024 •

edited

Loading

Azirine commented Jul 20, 2024

LostRuins commented Jul 21, 2024

Tokenizer broken for mixtral models #1009

Tokenizer broken for mixtral models #1009

Comments

Azirine commented Jul 17, 2024 • edited Loading

Azirine commented Jul 20, 2024

LostRuins commented Jul 21, 2024

Azirine commented Jul 17, 2024 •

edited

Loading