Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tokenizer broken for mixtral models #1009

Open
Azirine opened this issue Jul 17, 2024 · 2 comments
Open

Tokenizer broken for mixtral models #1009

Azirine opened this issue Jul 17, 2024 · 2 comments

Comments

@Azirine
Copy link

Azirine commented Jul 17, 2024

With Mixtral 8x7b, [INST] and [/INST] are not tokenized correctly.

[Debug: Dump Forwarded Input Tokens, format: 6]
'  (28705)', '\n (13)', '[ (28792)', 'INST (16289)', '] (28793)', ' hi (12014)', ' [ (733)', '/ (28748)', 'INST (16289)', '] (28793)', '\n (13)', 

Model: https://huggingface.co/mistralai/Mixtral-8x7B-Instruct-v0.1
Instruct Tag Preset: Mistral

Ggerganov also noted this problem on llama.cpp, although there is only happens for Mixtral 8x22b, not 8x7b.
ggml-org#7969 (comment)

With WizardLM-2-8x22B, this also happens with USER: and ASSISTANT:.

[Debug: Dump Forwarded Input Tokens, format: 6]
'<s> (1)', '  (28705)', '\n (13)', 'USER (11123)', ': (28747)', ' hi (12014)', '\n (13)', 'ASS (4816)', 'IST (8048)', 'ANT (12738)', ': (28747)', '  (28705)', 

Model: https://huggingface.co/alpindale/WizardLM-2-8x22B
Instruct Tag Preset: Vicuna

Format: Instruct Mode
Koboldcpp 1.70.1

@Azirine Azirine changed the title WizardLM-2-8x22B tokenizer fault Tokenizer broken for mixtral models Jul 20, 2024
@Azirine
Copy link
Author

Azirine commented Jul 20, 2024

I tested that this is not caused by b3028 on llama.cpp, because versions of koboldcpp prior to b3028 already have this issue.

@LostRuins
Copy link
Owner

Actually I don't think this is a bug. I'm looking at the mixtral vocab, and [INST] is not a token.

https://huggingface.co/neuralmagic/Mixtral-8x7B-Instruct-v0.1-FP8/raw/main/tokenizer.json
https://huggingface.co/Doctor-Shotgun/Mixtral-8x7B-Instruct-v0.1-LimaRP-ZLoss/raw/main/tokenizer.json

As far as I can see there are no mixtral models that use that added token.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants