Add an `Sequence` object to the decoders #872

SaulLu · 2022-01-06T16:47:48Z

I wonder if it would be useful to have a sequence object for the decoders too.

It seems to me for example that if we build a tokenizer with a BPE model that defines a end_of_word_suffix, we will need to use the BPEDecoder decoder to replace theend_of_word_suffix and if we also used a ByteLevel pre-tokenization we will need the ByteLevel decoder to realign the codes.

At the moment, it seems to me that we don't have a solution to choose a suitable decoder for such a tokenizer.

What do you think? 😄

The text was updated successfully, but these errors were encountered:

Narsil · 2022-01-07T08:56:15Z

Same as #873.
I am very favorable in principle.

def decode(tokens: List[str]) -> str

Is destructive, so we need to make it non destructive first.

It seems however that for this one the path seems simpler, we would need to check every single one but it seems decoders could be changed into

def decode(tokens: List[str]) -> List[str]

and make the final pass that is always the same ("".join(tokens)).

If that assumption holds, I guess it's very doable.

SaulLu · 2022-01-11T09:20:16Z

This is great news!

Your diagnosis makes a lot of sense (after I'm not super familiar with the code base, but I imagine that with tests we will confirm this diagnosis).

Fix #872

* Changing `Decoder` trait to be more composable. Fix #872 * Fixing Python side. * Fixing test. * Updating cleanup signature, removing turbofish.

SaulLu · 2022-05-27T10:46:47Z

To transcribe a design that had been discussed offline a long time ago.

It was discussed to implement 2 methods for each decoder: 1) keep the current decode method (to be backward compatible) and 2) implement a new method e.g. decode_for_chain which returns a list. This way the decode method of decoders.Sequence will apply all decode_for_chain methods in chain of the listed decoders and a final chaining.

SaulLu · 2022-05-27T10:48:29Z

Reopening due to ec43947

* Changing `Decoder` trait to be more composable. Fix #872 * Fixing Python side. * Fixing test. * Updating cleanup signature, removing turbofish.

* Changing `Decoder` trait to be more composable. (#938) * Changing `Decoder` trait to be more composable. Fix #872 * Fixing Python side. * Fixing test. * Updating cleanup signature, removing turbofish. * Adding `Sequence` Decoder.

SaulLu added the enhancement New feature or request label Jan 6, 2022

SaulLu mentioned this issue Jan 7, 2022

fix CLIP fast tokenizer and change some properties of the slow version huggingface/transformers#15067

Merged

5 tasks

Narsil self-assigned this Jan 11, 2022

Narsil added the good first issue Good for newcomers label Mar 1, 2022

Narsil added this to the 0.12 milestone Mar 1, 2022

Narsil mentioned this issue Mar 1, 2022

Implement a new Decoder to support Byte->Char hack spm conversion. #928

Closed

Narsil added a commit that referenced this issue Mar 4, 2022

Changing Decoder trait to be more composable.

bf85be7

Fix #872

This was referenced Mar 4, 2022

Changing Decoder trait to be more composable. #938

Merged

Adding Sequence Decoder. #940

Merged

Narsil closed this as completed in #938 Mar 17, 2022

Narsil added a commit that referenced this issue Mar 17, 2022

Changing Decoder trait to be more composable. (#938)

cdabef1

* Changing `Decoder` trait to be more composable. Fix #872 * Fixing Python side. * Fixing test. * Updating cleanup signature, removing turbofish.

SaulLu reopened this May 27, 2022

Narsil added a commit that referenced this issue Jun 1, 2022

Changing Decoder trait to be more composable. (#938)

04895ab

* Changing `Decoder` trait to be more composable. Fix #872 * Fixing Python side. * Fixing test. * Updating cleanup signature, removing turbofish.

Narsil mentioned this issue Jun 1, 2022

Changing Decoder trait to be more composable. (#938) #1008

Merged

Narsil closed this as completed in 943b542 Jun 2, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add an `Sequence` object to the decoders #872

Add an `Sequence` object to the decoders #872

SaulLu commented Jan 6, 2022 •

edited

Loading

Narsil commented Jan 7, 2022

SaulLu commented Jan 11, 2022

SaulLu commented May 27, 2022

SaulLu commented May 27, 2022

Add an Sequence object to the decoders #872

Add an Sequence object to the decoders #872

Comments

SaulLu commented Jan 6, 2022 • edited Loading

Narsil commented Jan 7, 2022

SaulLu commented Jan 11, 2022

SaulLu commented May 27, 2022

SaulLu commented May 27, 2022

Add an `Sequence` object to the decoders #872

Add an `Sequence` object to the decoders #872

SaulLu commented Jan 6, 2022 •

edited

Loading