You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I would like to propose adding a skip_special_tokens parameter to the .encode() method in Transformers. Currently, in order to achieve this behavior, I have to either create two different tokenizers or use a workaround such as inserting a character in the middle of a special token and then removing it to simulate the desired behavior.
Motivation
The motivation for this feature request is that in real-world scenarios, users may enter any type of textual data, including special tokens used by the tokenizer. If the tokenizer were to tokenize the user's input as is, it would cause confusion for the whole model and impact the performance of the product. The skip_special_tokens parameter is essential for ensuring the correct processing of user inputs, not just for the decode() method but also for the encode() and __call__() methods.
Your contribution
I have implemented my own tokenizer that inherits from Transformers and simulates this behavior by removing the special tokens from the vocab before encoding. However, I believe this approach would not be efficient for scaling up, as it would cause a lot of memory allocations and deallocations.
To address this issue, I suggest implementing two separate dictionaries, one for special tokens and one for the vocabulary, and incorporating an if-statement to test for the skip_special_tokens parameter. This would make the implementation performant and efficient.
Thank you for considering this feature request.
The text was updated successfully, but these errors were encountered:
Hey, we have to consider whether or not we want to maintain this and add this as a functionality to ALL tokenizers.
If you actually want to skip the special tokens, then a simple way to do this in the slow tokenizer is to modify the tokenize function like the following:
This could be added as it is general enough (though might not have a lot of usages) and requires base modifications.
However, if you are looking for something similar to a fallback where the special tokens are not split, I don't really see the need of removing the token from the vocabulary. You have to redefine the convert_tokens_to_ids function. Here is a snippet:
defconvert_tokens_to_ids(self, tokens: Union[str, List[str]]) ->Union[int, List[int]]:
fortokenintokens:
iftokeninself.all_special_tokens: # ['[UNK]', '[SEP]', '[PAD]', '[CLS]', '[MASK]']# post process the way you want. Split the string? fortokintoken.split():
ids.append(self._convert_token_to_id_with_added_voc(tok))
ids.append(self._convert_token_to_id_with_added_voc(token))
returnids
This is something pretty specific, and I don't see a reason to include it to transformers.
Feature request
I would like to propose adding a skip_special_tokens parameter to the .encode() method in Transformers. Currently, in order to achieve this behavior, I have to either create two different tokenizers or use a workaround such as inserting a character in the middle of a special token and then removing it to simulate the desired behavior.
Motivation
The motivation for this feature request is that in real-world scenarios, users may enter any type of textual data, including special tokens used by the tokenizer. If the tokenizer were to tokenize the user's input as is, it would cause confusion for the whole model and impact the performance of the product. The skip_special_tokens parameter is essential for ensuring the correct processing of user inputs, not just for the
decode()
method but also for theencode()
and__call__()
methods.Your contribution
I have implemented my own tokenizer that inherits from Transformers and simulates this behavior by removing the special tokens from the vocab before encoding. However, I believe this approach would not be efficient for scaling up, as it would cause a lot of memory allocations and deallocations.
To address this issue, I suggest implementing two separate dictionaries, one for special tokens and one for the vocabulary, and incorporating an if-statement to test for the skip_special_tokens parameter. This would make the implementation performant and efficient.
Thank you for considering this feature request.
The text was updated successfully, but these errors were encountered: