Implement a new Decoder to support Byte->Char hack spm conversion. #928

Narsil · 2022-03-01T09:53:13Z

Has to go after: #872

And implement this algorithm:

https://github.com/huggingface/tokenizers/pull/896/files#diff-0dce407dfb4cba577e8da3aecb16d5e52d0e35faa1df70d1845c69790ca1651cR81

Basically, when decoding a token that "looks" like a byte <0x16> then, treat it as a byte.
This trick is used in SPM too and makes it possible to reduce the vocabulary size by not including every unicode codepoint in the tokenizer, without actually reworking everything to iterate on pure bytes.

The text was updated successfully, but these errors were encountered:

Narsil · 2022-03-07T09:11:07Z

Closing this, as the encoding part is not being done, and ByteLevel already implements it.

Narsil added this to the 0.12 (Bigscience enabled, clean modifications) milestone Mar 1, 2022

Narsil mentioned this issue Mar 1, 2022

Some questions about building a tokenizer from scratch: vocab size can't decide actual vocab size and token order unstable. #903

Closed

Narsil closed this as completed Mar 7, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement a new Decoder to support Byte->Char hack spm conversion. #928

Implement a new Decoder to support Byte->Char hack spm conversion. #928

Narsil commented Mar 1, 2022

Narsil commented Mar 7, 2022

Implement a new Decoder to support Byte->Char hack spm conversion. #928

Implement a new Decoder to support Byte->Char hack spm conversion. #928

Comments

Narsil commented Mar 1, 2022

Narsil commented Mar 7, 2022