-
Notifications
You must be signed in to change notification settings - Fork 11k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add test for MPT tokenization #3728
Conversation
I adapted changes from this PR locally on my stablelm branch and they seem to fix the issue from #3604 . When passing |
I think this is what I feared: there are Edit: @jploski just showed us a way to distinguish them via 74204cc#commitcomment-130628610. So I tend to merge this PR and improve on handling the mentioned issue in another one. |
I just wanted to let you know, for stablelm use case it's fine, (added_vocab is only spaces and end of text). Thanks for this! |
To fix this issue I imagine to change the detokenizer code to output |
Unfortunately I don't see a way to distinguish
CONTROL
andUSER_DEFINED
tokens when usingAutoTokenizer
.