Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[split_special_tokens] Add support for split_special_tokens argument to encode #25081

Merged
merged 26 commits into from
Aug 18, 2023

Conversation

ArthurZucker
Copy link
Collaborator

What does this PR do?

Argument name is totally debatable. Will also require a pull request in tokenizers.
The goal is to be able to simply activate and de-activate the special token splitting. Feature was asked in #22490, and is required for some production type cases, where users pass inputs and we don't want them to be able to hack them

@ArthurZucker ArthurZucker changed the title [split_special_tokens] Add support for split_special_tokens argument to encode WIP [split_special_tokens] Add support for split_special_tokens argument to encode Jul 25, 2023
@HuggingFaceDocBuilderDev
Copy link

HuggingFaceDocBuilderDev commented Jul 26, 2023

The documentation is not available anymore as the PR was closed or merged.

@ArthurZucker ArthurZucker changed the title WIP [split_special_tokens] Add support for split_special_tokens argument to encode [split_special_tokens] Add support for split_special_tokens argument to encode Aug 17, 2023
@ArthurZucker ArthurZucker requested a review from sgugger August 17, 2023 15:25
Copy link
Collaborator

@sgugger sgugger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks again!

src/transformers/tokenization_utils_base.py Outdated Show resolved Hide resolved
@ArthurZucker ArthurZucker merged commit 30b3c46 into huggingface:main Aug 18, 2023
blbadger pushed a commit to blbadger/transformers that referenced this pull request Nov 8, 2023
…ent to encode (huggingface#25081)

* draft changes

* update and add tests

* styling for no

* move test

* path to usable model

* update test

* small update

* update bertbased tokenizers

* don'tuse kwargs for _tokenize

* don'tuse kwargs for _tokenize

* fix copies

* update

* update test for special tokenizers

* fixup

* skip two tests

* remove pdb breakpiont()

* wowo

* rewrite custom tests

* nits

* revert chang in target keys

* fix markup lm

* update documentation of the argument
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants