Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add an Sequence object to the decoders #872

Closed
SaulLu opened this issue Jan 6, 2022 · 4 comments · Fixed by #938
Closed

Add an Sequence object to the decoders #872

SaulLu opened this issue Jan 6, 2022 · 4 comments · Fixed by #938
Assignees
Labels
enhancement New feature or request good first issue Good for newcomers

Comments

@SaulLu
Copy link
Contributor

SaulLu commented Jan 6, 2022

I wonder if it would be useful to have a sequence object for the decoders too.

It seems to me for example that if we build a tokenizer with a BPE model that defines a end_of_word_suffix, we will need to use the BPEDecoder decoder to replace theend_of_word_suffix and if we also used a ByteLevel pre-tokenization we will need the ByteLevel decoder to realign the codes.

At the moment, it seems to me that we don't have a solution to choose a suitable decoder for such a tokenizer.

What do you think? 😄

@SaulLu SaulLu added the enhancement New feature or request label Jan 6, 2022
@Narsil
Copy link
Collaborator

Narsil commented Jan 7, 2022

Same as #873.
I am very favorable in principle.

def decode(tokens: List[str]) -> str

Is destructive, so we need to make it non destructive first.

It seems however that for this one the path seems simpler, we would need to check every single one but it seems decoders could be changed into

def decode(tokens: List[str]) -> List[str]

and make the final pass that is always the same ("".join(tokens)).

If that assumption holds, I guess it's very doable.

@SaulLu
Copy link
Contributor Author

SaulLu commented Jan 11, 2022

This is great news!

Your diagnosis makes a lot of sense (after I'm not super familiar with the code base, but I imagine that with tests we will confirm this diagnosis).

@Narsil Narsil self-assigned this Jan 11, 2022
@Narsil Narsil added the good first issue Good for newcomers label Mar 1, 2022
@Narsil Narsil added this to the 0.12 milestone Mar 1, 2022
Narsil added a commit that referenced this issue Mar 4, 2022
Narsil added a commit that referenced this issue Mar 17, 2022
* Changing `Decoder` trait to be more composable.

Fix #872

* Fixing Python side.

* Fixing test.

* Updating cleanup signature, removing turbofish.
@SaulLu
Copy link
Contributor Author

SaulLu commented May 27, 2022

To transcribe a design that had been discussed offline a long time ago.

It was discussed to implement 2 methods for each decoder: 1) keep the current decode method (to be backward compatible) and 2) implement a new method e.g. decode_for_chain which returns a list. This way the decode method of decoders.Sequence will apply all decode_for_chain methods in chain of the listed decoders and a final chaining.

@SaulLu SaulLu reopened this May 27, 2022
@SaulLu
Copy link
Contributor Author

SaulLu commented May 27, 2022

Reopening due to ec43947

Narsil added a commit that referenced this issue Jun 1, 2022
* Changing `Decoder` trait to be more composable.

Fix #872

* Fixing Python side.

* Fixing test.

* Updating cleanup signature, removing turbofish.
@Narsil Narsil closed this as completed in 943b542 Jun 2, 2022
Narsil added a commit that referenced this issue Aug 23, 2022
* Changing `Decoder` trait to be more composable. (#938)

* Changing `Decoder` trait to be more composable.

Fix #872

* Fixing Python side.

* Fixing test.

* Updating cleanup signature, removing turbofish.

* Adding `Sequence` Decoder.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request good first issue Good for newcomers
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants