You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Basically, when decoding a token that "looks" like a byte <0x16> then, treat it as a byte.
This trick is used in SPM too and makes it possible to reduce the vocabulary size by not including every unicode codepoint in the tokenizer, without actually reworking everything to iterate on pure bytes.
The text was updated successfully, but these errors were encountered:
Has to go after: #872
And implement this algorithm:
https://github.com/huggingface/tokenizers/pull/896/files#diff-0dce407dfb4cba577e8da3aecb16d5e52d0e35faa1df70d1845c69790ca1651cR81
Basically, when decoding a token that "looks" like a byte
<0x16>
then, treat it as a byte.This trick is used in SPM too and makes it possible to reduce the vocabulary size by not including every unicode codepoint in the tokenizer, without actually reworking everything to iterate on pure bytes.
The text was updated successfully, but these errors were encountered: