-
Notifications
You must be signed in to change notification settings - Fork 836
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add an Sequence
object to the decoders
#872
Comments
Same as #873. def decode(tokens: List[str]) -> str Is destructive, so we need to make it non destructive first. It seems however that for this one the path seems simpler, we would need to check every single one but it seems def decode(tokens: List[str]) -> List[str] and make the final pass that is always the same ("".join(tokens)). If that assumption holds, I guess it's very doable. |
This is great news! Your diagnosis makes a lot of sense (after I'm not super familiar with the code base, but I imagine that with tests we will confirm this diagnosis). |
To transcribe a design that had been discussed offline a long time ago. It was discussed to implement 2 methods for each decoder: 1) keep the current |
Reopening due to ec43947 |
I wonder if it would be useful to have a sequence object for the decoders too.
It seems to me for example that if we build a tokenizer with a BPE model that defines a
end_of_word_suffix
, we will need to use theBPEDecoder
decoder to replace theend_of_word_suffix
and if we also used a ByteLevel pre-tokenization we will need theByteLevel
decoder to realign the codes.At the moment, it seems to me that we don't have a solution to choose a suitable decoder for such a tokenizer.
What do you think? 😄
The text was updated successfully, but these errors were encountered: