Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add test for MPT tokenization #3728

Merged
merged 4 commits into from
Oct 22, 2023
Merged

Add test for MPT tokenization #3728

merged 4 commits into from
Oct 22, 2023

Conversation

goerch
Copy link
Collaborator

@goerch goerch commented Oct 22, 2023

Unfortunately I don't see a way to distinguish CONTROL and USER_DEFINED tokens when using AutoTokenizer.

@Galunid
Copy link
Collaborator

Galunid commented Oct 22, 2023

I adapted changes from this PR locally on my stablelm branch and they seem to fix the issue from #3604 .

When passing The best music is as a prompt (note two spaces at the end), they seem to get stripped though
The best music isthe music you have the most fun making.

@goerch
Copy link
Collaborator Author

goerch commented Oct 22, 2023

When passing The best music is as a prompt (note two spaces at the end), they seem to get stripped though
The best music isthe music you have the most fun making.

I think this is what I feared: there are special and non-special added tokens which have to be handled before the core tokenizer's encode and decode and we can't distinguish them currently.

Edit: @jploski just showed us a way to distinguish them via 74204cc#commitcomment-130628610. So I tend to merge this PR and improve on handling the mentioned issue in another one.

@Galunid
Copy link
Collaborator

Galunid commented Oct 22, 2023

I just wanted to let you know, for stablelm use case it's fine, (added_vocab is only spaces and end of text). Thanks for this!

@goerch goerch merged commit 9e70cc0 into ggml-org:master Oct 22, 2023
@goerch
Copy link
Collaborator Author

goerch commented Oct 23, 2023

I adapted changes from this PR locally on my stablelm branch and they seem to fix the issue from #3604 .

When passing The best music is as a prompt (note two spaces at the end), they seem to get stripped though The best music isthe music you have the most fun making.

To fix this issue I imagine to change the detokenizer code to output USER_DEFINED tokens (this should repair the problem) and adapt the test cases to ignore USER_DEFINED tokens where needed. Will propose this once the test cases are done. Does this sound like a plan?

@goerch goerch deleted the test-#3604 branch October 24, 2023 11:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants