Adding a skip_special_tokens Parameter to .encode() in Transformers #22490

Beyondo · 2023-03-31T11:34:32Z

Feature request

I would like to propose adding a skip_special_tokens parameter to the .encode() method in Transformers. Currently, in order to achieve this behavior, I have to either create two different tokenizers or use a workaround such as inserting a character in the middle of a special token and then removing it to simulate the desired behavior.

Motivation

The motivation for this feature request is that in real-world scenarios, users may enter any type of textual data, including special tokens used by the tokenizer. If the tokenizer were to tokenize the user's input as is, it would cause confusion for the whole model and impact the performance of the product. The skip_special_tokens parameter is essential for ensuring the correct processing of user inputs, not just for the decode() method but also for the encode() and __call__() methods.

Your contribution

I have implemented my own tokenizer that inherits from Transformers and simulates this behavior by removing the special tokens from the vocab before encoding. However, I believe this approach would not be efficient for scaling up, as it would cause a lot of memory allocations and deallocations.

To address this issue, I suggest implementing two separate dictionaries, one for special tokens and one for the vocabulary, and incorporating an if-statement to test for the skip_special_tokens parameter. This would make the implementation performant and efficient.

Thank you for considering this feature request.

The text was updated successfully, but these errors were encountered:

sgugger · 2023-03-31T13:30:24Z

cc @ArthurZucker

ArthurZucker · 2023-04-24T12:40:09Z

Hey, we have to consider whether or not we want to maintain this and add this as a functionality to ALL tokenizers.
If you actually want to skip the special tokens, then a simple way to do this in the slow tokenizer is to modify the tokenize function like the following:

    def tokenize(self, text: TextInput, **kwargs) -> List[str]:
        ......
        skip_special_tokens = kwargs.pop("skip_special_tokens", False)
        for i, token in enumerate(tokens):
            if token in no_split_token:
                .........
                if isinstance(tok_extended, AddedToken):
                    if skip_special_tokens:
                        tokens[i] = None
                    else:
                      .....
        return tokenized_text

This could be added as it is general enough (though might not have a lot of usages) and requires base modifications.

However, if you are looking for something similar to a fallback where the special tokens are not split, I don't really see the need of removing the token from the vocabulary. You have to redefine the convert_tokens_to_ids function. Here is a snippet:

def convert_tokens_to_ids(self, tokens: Union[str, List[str]]) -> Union[int, List[int]]:
        for token in tokens:
            if token in self.all_special_tokens: # ['[UNK]', '[SEP]', '[PAD]', '[CLS]', '[MASK]']
              # post process the way you want. Split the string? 
              for tok in token.split():
                ids.append(self._convert_token_to_id_with_added_voc(tok))
            ids.append(self._convert_token_to_id_with_added_voc(token))
        return ids

This is something pretty specific, and I don't see a reason to include it to transformers.

ArthurZucker added the Feature request Request for a new feature label Apr 24, 2023

ArthurZucker mentioned this issue Jul 25, 2023

[split_special_tokens] Add support for split_special_tokens argument to encode #25081

Merged

ArthurZucker closed this as completed Aug 18, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adding a skip_special_tokens Parameter to .encode() in Transformers #22490

Adding a skip_special_tokens Parameter to .encode() in Transformers #22490

Beyondo commented Mar 31, 2023

sgugger commented Mar 31, 2023

ArthurZucker commented Apr 24, 2023

Adding a skip_special_tokens Parameter to .encode() in Transformers #22490

Adding a skip_special_tokens Parameter to .encode() in Transformers #22490

Comments

Beyondo commented Mar 31, 2023

Feature request

Motivation

Your contribution

sgugger commented Mar 31, 2023

ArthurZucker commented Apr 24, 2023