Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding a skip_special_tokens Parameter to .encode() in Transformers #22490

Closed
Beyondo opened this issue Mar 31, 2023 · 2 comments
Closed

Adding a skip_special_tokens Parameter to .encode() in Transformers #22490

Beyondo opened this issue Mar 31, 2023 · 2 comments
Labels
Feature request Request for a new feature

Comments

@Beyondo
Copy link

Beyondo commented Mar 31, 2023

Feature request

I would like to propose adding a skip_special_tokens parameter to the .encode() method in Transformers. Currently, in order to achieve this behavior, I have to either create two different tokenizers or use a workaround such as inserting a character in the middle of a special token and then removing it to simulate the desired behavior.

Motivation

The motivation for this feature request is that in real-world scenarios, users may enter any type of textual data, including special tokens used by the tokenizer. If the tokenizer were to tokenize the user's input as is, it would cause confusion for the whole model and impact the performance of the product. The skip_special_tokens parameter is essential for ensuring the correct processing of user inputs, not just for the decode() method but also for the encode() and __call__() methods.

Your contribution

I have implemented my own tokenizer that inherits from Transformers and simulates this behavior by removing the special tokens from the vocab before encoding. However, I believe this approach would not be efficient for scaling up, as it would cause a lot of memory allocations and deallocations.

To address this issue, I suggest implementing two separate dictionaries, one for special tokens and one for the vocabulary, and incorporating an if-statement to test for the skip_special_tokens parameter. This would make the implementation performant and efficient.

Thank you for considering this feature request.

@sgugger
Copy link
Collaborator

sgugger commented Mar 31, 2023

cc @ArthurZucker

@ArthurZucker ArthurZucker added the Feature request Request for a new feature label Apr 24, 2023
@ArthurZucker
Copy link
Collaborator

Hey, we have to consider whether or not we want to maintain this and add this as a functionality to ALL tokenizers.
If you actually want to skip the special tokens, then a simple way to do this in the slow tokenizer is to modify the tokenize function like the following:

    def tokenize(self, text: TextInput, **kwargs) -> List[str]:
        ......
        skip_special_tokens = kwargs.pop("skip_special_tokens", False)
        for i, token in enumerate(tokens):
            if token in no_split_token:
                .........
                if isinstance(tok_extended, AddedToken):
                    if skip_special_tokens:
                        tokens[i] = None
                    else:
                      .....
        return tokenized_text

This could be added as it is general enough (though might not have a lot of usages) and requires base modifications.

However, if you are looking for something similar to a fallback where the special tokens are not split, I don't really see the need of removing the token from the vocabulary. You have to redefine the convert_tokens_to_ids function. Here is a snippet:

def convert_tokens_to_ids(self, tokens: Union[str, List[str]]) -> Union[int, List[int]]:
        for token in tokens:
            if token in self.all_special_tokens: # ['[UNK]', '[SEP]', '[PAD]', '[CLS]', '[MASK]']
              # post process the way you want. Split the string? 
              for tok in token.split():
                ids.append(self._convert_token_to_id_with_added_voc(tok))
            ids.append(self._convert_token_to_id_with_added_voc(token))
        return ids

This is something pretty specific, and I don't see a reason to include it to transformers.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Feature request Request for a new feature
Projects
None yet
Development

No branches or pull requests

3 participants