Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement a new Decoder to support Byte->Char hack spm conversion. #928

Closed
Narsil opened this issue Mar 1, 2022 · 1 comment
Closed

Comments

@Narsil
Copy link
Collaborator

Narsil commented Mar 1, 2022

Has to go after: #872

And implement this algorithm:

https://github.com/huggingface/tokenizers/pull/896/files#diff-0dce407dfb4cba577e8da3aecb16d5e52d0e35faa1df70d1845c69790ca1651cR81

Basically, when decoding a token that "looks" like a byte <0x16> then, treat it as a byte.
This trick is used in SPM too and makes it possible to reduce the vocabulary size by not including every unicode codepoint in the tokenizer, without actually reworking everything to iterate on pure bytes.

@Narsil
Copy link
Collaborator Author

Narsil commented Mar 7, 2022

Closing this, as the encoding part is not being done, and ByteLevel already implements it.

@Narsil Narsil closed this as completed Mar 7, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant