Optionally remove the REGEXP from `ByteLevel` pre_tokenizer #931

Narsil · 2022-03-01T10:03:50Z

This is sort of done here:

https://github.com/huggingface/tokenizers/pull/896/files#diff-c2be5839194fbc7e052d4600684eeb35f0db504b5b4df6565f01057762e23774R46

There might be some changes which might make it better, like using None instead of "no_regexp" in Python.
(In RUST, since the regexp is hardcoded, and we definitely don't want to enable users to choose, since it should be done as another pre_tokenizer if you want custom regexp, going with the simple Enum instead of Option seems more correct as a first approach.

The text was updated successfully, but these errors were encountered:

Narsil added this to the 0.12 (Bigscience enabled, clean modifications) milestone Mar 1, 2022

Narsil mentioned this issue Mar 5, 2022

Making the regex in ByteLevel optional. #939

Merged

Narsil closed this as completed in #939 Mar 18, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optionally remove the REGEXP from `ByteLevel` pre_tokenizer #931

Optionally remove the REGEXP from `ByteLevel` pre_tokenizer #931

Narsil commented Mar 1, 2022

Optionally remove the REGEXP from ByteLevel pre_tokenizer #931

Optionally remove the REGEXP from ByteLevel pre_tokenizer #931

Comments

Narsil commented Mar 1, 2022

Optionally remove the REGEXP from `ByteLevel` pre_tokenizer #931

Optionally remove the REGEXP from `ByteLevel` pre_tokenizer #931