Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AddedToken 's argument are ignored when called in add_tokens 's method of slow tokenizers #20734

Closed
2 of 4 tasks
SaulLu opened this issue Dec 12, 2022 · 3 comments · Fixed by #23909
Closed
2 of 4 tasks
Assignees
Labels
Core: Tokenization Internals of the library; Tokenization.

Comments

@SaulLu
Copy link
Contributor

SaulLu commented Dec 12, 2022

System Info

  • transformers version: 4.25.1
  • Platform: Linux-5.10.133+-x86_64-with-glibc2.27
  • Python version: 3.8.16
  • Huggingface_hub version: 0.11.1
  • PyTorch version (GPU?): 1.13.0+cu116 (False)
  • Tensorflow version (GPU?): 2.9.2 (False)
  • Flax version (CPU?/GPU?/TPU?): not installed (NA)
  • Jax version: not installed
  • JaxLib version: not installed
  • Using GPU in script?: No
  • Using distributed or parallel set-up in script?: No

Who can help?

@ArthurZucker

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

The explanations of the bug and its reproduction are contained in the following google colab: https://colab.research.google.com/drive/19SS6Tzlgo0vntFtM6ZsCYq8BNZ5Dy1cS?usp=sharing

Expected behavior

I would expect the fast and slow tokenizers to treat the AddedToken's arguments in the same way.

I think the loss of information for the slow tokenizer occurs at this line:

new_tokens = [str(tok) for tok in new_tokens]

@ArthurZucker ArthurZucker self-assigned this Dec 16, 2022
@huggingface huggingface deleted a comment from github-actions bot Jan 13, 2023
@huggingface huggingface deleted a comment from github-actions bot Feb 14, 2023
@ArthurZucker ArthurZucker reopened this Feb 15, 2023
@huggingface huggingface deleted a comment from github-actions bot Mar 12, 2023
@huggingface huggingface deleted a comment from github-actions bot Apr 6, 2023
@huggingface huggingface deleted a comment from github-actions bot May 11, 2023
@amyeroberts amyeroberts reopened this May 11, 2023
@ArthurZucker ArthurZucker added the Core: Tokenization Internals of the library; Tokenization. label May 25, 2023
@ArthurZucker
Copy link
Collaborator

I have not dropped this, its still on my TODO list. There are a lot of linked issues!

@ArthurZucker
Copy link
Collaborator

(The update is gonna take longer as I am refactoring the tokenizers)

@huggingface huggingface deleted a comment from github-actions bot Jul 19, 2023
@superRookie007
Copy link

I had the same issue. It seems that AddedToken is implemented here, if the package tokenizers is not available. However, the python implementation using dataclasses does not behave the same way as the rust implementation in tokenizers.

@huggingface huggingface deleted a comment from github-actions bot Aug 21, 2023
@huggingface huggingface deleted a comment from github-actions bot Sep 15, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Core: Tokenization Internals of the library; Tokenization.
Projects
None yet
4 participants