add a template to add missing tokenization test #16553

SaulLu · 2022-04-01T15:20:18Z

What does this PR do?

In this PR I propose to add a cookie cutter template for tokenization tests.

It is a first version and could be useful to propose in a good first issue to users to add the missing tests to the tokenizers of the following models:

Flaubert Add test suite for flaubert tokenizer #15137
LED
RemBert
Splinter

and eventually for these tokenizers too (currently they just inherit from BertTokenizer and just re-define the attributes):

MobileBert
ConvBert
Electra
Longformer
RetriBert

I plan to give a little more information on how to add the test to the good first issue ticket than is currently shown in the readme. But don't hesitate if you think it is better to say as much as possible in the current readme

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a Github issue or the forum? Please add a link
to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. I would love to have your input on this @LysandreJik , @sgugger , @patrickvonplaten or @patil-suraj 🤗

SaulLu · 2022-04-01T15:27:43Z

...emplate-{{cookiecutter.modelname}}/test_tokenization_{{cookiecutter.lowercase_modelname}}.py

+            "`self.tmpdirname`."
+        )
+
+    # TODO: add tests with hard-coded target values 


In this template currently I don't propose to implement any additional specific tests but please let me know if you think there are any generic tests - that are not in test_tokenization_common.py - that should be implemented for each tokenizer.

I'm pointing this out in particular because I've noticed that there are some tests that are shared by multiple tokenizers. This is for example the case for test_convert_token_and_id that is implemented in 18 different tests files (out of 63).

transformers/tests/camembert/test_tokenization_camembert.py

Lines 49 to 55 in 4975002

def test_convert_token_and_id(self):

"""Test ``_convert_token_to_id`` and ``_convert_id_to_token``."""

token = "<pad>"

token_id = 1

self.assertEqual(self.get_tokenizer()._convert_token_to_id(token), token_id)

self.assertEqual(self.get_tokenizer()._convert_id_to_token(token_id), token)

HuggingFaceDocBuilderDev · 2022-04-01T15:41:52Z

The documentation is not available anymore as the PR was closed or merged.

sgugger

Thanks for adding this!

templates/adding_a_missing_tokenization_test/README.md

Co-authored-by: Sylvain Gugger <[email protected]>

LysandreJik

Thank you @SaulLu!

* add a template to add missing tokenization test * add cookiecutter setting * improve doc * Update templates/adding_a_missing_tokenization_test/README.md Co-authored-by: Sylvain Gugger <[email protected]> Co-authored-by: Sylvain Gugger <[email protected]>

* 📝 add image/vision classification and asr * 🖍 minor formatting fixes * Fixed a typo in legacy seq2seq_trainer.py (#16531) * Add ONNX export for BeiT (#16498) * Add beit onnx conversion support * Updated docs * Added cross reference to ViT ONNX config * call on_train_end when trial is pruned (#16536) * Type hints added (#16529) * Fix Bart type hints (#16297) * Add type hints to PLBart PyTorch * Remove pending merge conflicts * Fix PLBart Type Hints * Add changes from review * Add VisualBert type hints (#16544) * Adding missing type hints for mBART model (PyTorch) (#16429) * added type hints for mbart tensorflow tf implementation * Adding missing type hints for mBART model Tensorflow Implementation model added with missing type hints * Missing Type hints - correction For TF model * Code fixup using make quality tests * Hint types - typo error * make fix-copies and make fixup * type hints * updated files * type hints update * making dependent modesls coherent Co-authored-by: matt <[email protected]> * Remove MBart subclass of XLMRoberta in tokenzier docs (#16546) * Remove MBart subclass of XLMRoberta in tokenzier * Fix style * Copy docs from MBart50 tokenizer * Use random_attention_mask for TF tests (#16517) * use random_attention_mask for TF tests * Fix for TFCLIP test (for now). Co-authored-by: ydshieh <[email protected]> * Improve code example (#16450) Co-authored-by: Niels Rogge <[email protected]> * Pin tokenizers version <0.13 (#16539) * Pin tokenizers version <0.13 * Style * Add code samples for TF speech models (#16494) Co-authored-by: ydshieh <[email protected]> * [FlaxSpeechEncoderDecoder] Fix dtype bug (#16581) * [FlaxSpeechEncoderDecoder] Fix dtype bug * more fixes * Making the impossible to connect error actually report the right URL. (#16446) * Fix flax import in __init__.py: modeling_xglm -> modeling_flax_xglm (#16556) * Add utility to find model labels (#16526) * Add utility to find model labels * Use it in the Trainer * Update src/transformers/utils/generic.py Co-authored-by: Matt <[email protected]> * Quality Co-authored-by: Matt <[email protected]> * Enable doc in Spanish (#16518) * Reorganize doc for multilingual support * Fix style * Style * Toc trees * Adapt templates * Add use_auth to load_datasets for private datasets to PT and TF examples (#16521) * fix formatting and remove use_auth * Add use_auth_token to Flax examples * add a test checking the format of `convert_tokens_to_string`'s output (#16540) * add new tests * add comment to overridden tests * TF: Finalize `unpack_inputs`-related changes (#16499) * Add unpack_inputs to remaining models * removed kwargs to `call()` in TF models * fix TF T5 tests * [SpeechEncoderDecoderModel] Correct Encoder Last Hidden State Output (#16586) * initialize the default rank set on TrainerState (#16530) * initialize the default rank set on TrainerState * fix style * Trigger doc build * Fix CI: test_inference_for_pretraining in ViTMAEModelTest (#16591) Co-authored-by: ydshieh <[email protected]> * add a template to add missing tokenization test (#16553) * add a template to add missing tokenization test * add cookiecutter setting * improve doc * Update templates/adding_a_missing_tokenization_test/README.md Co-authored-by: Sylvain Gugger <[email protected]> Co-authored-by: Sylvain Gugger <[email protected]> * made _load_pretrained_model_low_mem static + bug fix (#16548) * handle torch_dtype in low cpu mem usage (#16580) * [Doctests] Correct filenaming (#16599) * [Doctests] Correct filenaming * improve quicktour * make style * Adding new train_step logic to make things less confusing for users (#15994) * Adding new train_step logic to make things less confusing for users * DO NOT ASK WHY WE NEED THAT SUBCLASS * Metrics now working, at least for single-output models with type annotations! * Updates and TODOs for the new train_step * Make fixup * Temporary test workaround until T5 has types * Temporary test workaround until T5 has types * I think this actually works! Needs a lot of tests though * MAke style/quality * Revert changes to T5 tests * Deleting the aforementioned unmentionable subclass * Deleting the aforementioned unmentionable subclass * Adding a Keras API test * Style fixes * Removing unneeded TODO and comments * Update test_step too * Stop trying to compute metrics with the dummy_loss, patch up test * Make style * make fixup * Docstring cleanup * make fixup * make fixup * Stop expanding 1D input tensors when using dummy loss * Adjust T5 test given the new compile() * make fixup * Skipping test for convnext * Removing old T5-specific Keras test now that we have a common one * make fixup * make fixup * Only skip convnext test on CPU * Update src/transformers/modeling_tf_utils.py Co-authored-by: Sylvain Gugger <[email protected]> * Update src/transformers/modeling_tf_utils.py Co-authored-by: Sylvain Gugger <[email protected]> * Avoiding TF import issues * make fixup * Update compile() to support TF 2.3 * Skipping model.fit() on template classes for now * Skipping model.fit() on template class tests for now * Replace ad-hoc solution with find_labels * make fixup Co-authored-by: Sylvain Gugger <[email protected]> * Adding missing type hints for BigBird model (#16555) * added type hints for mbart tensorflow tf implementation * Adding missing type hints for mBART model Tensorflow Implementation model added with missing type hints * Missing Type hints - correction For TF model * Code fixup using make quality tests * Hint types - typo error * make fix-copies and make fixup * type hints * updated files * type hints update * making dependent modesls coherent * Type hints for BigBird * removing typos Co-authored-by: matt <[email protected]> * [deepspeed] fix typo, adjust config name (#16597) * 🖍 apply feedback Co-authored-by: Cathy <[email protected]> Co-authored-by: Jim Rohrer <[email protected]> Co-authored-by: Ferdinand Schlatt <[email protected]> Co-authored-by: Dahlbomii <[email protected]> Co-authored-by: Gunjan Chhablani <[email protected]> Co-authored-by: Rishav Chandra Varma <[email protected]> Co-authored-by: matt <[email protected]> Co-authored-by: Yih-Dar <[email protected]> Co-authored-by: ydshieh <[email protected]> Co-authored-by: NielsRogge <[email protected]> Co-authored-by: Niels Rogge <[email protected]> Co-authored-by: Lysandre Debut <[email protected]> Co-authored-by: Patrick von Platen <[email protected]> Co-authored-by: Nicolas Patry <[email protected]> Co-authored-by: Daniel Stancl <[email protected]> Co-authored-by: Sylvain Gugger <[email protected]> Co-authored-by: Matt <[email protected]> Co-authored-by: Karim Foda <[email protected]> Co-authored-by: SaulLu <[email protected]> Co-authored-by: Joao Gante <[email protected]> Co-authored-by: Sanchit Gandhi <[email protected]> Co-authored-by: Andres Codas <[email protected]> Co-authored-by: Sylvain Gugger <[email protected]> Co-authored-by: Francesco Saverio Zuppichini <[email protected]> Co-authored-by: Suraj Patil <[email protected]> Co-authored-by: Stas Bekman <[email protected]>

add a template to add missing tokenization test

1e4dd9f

SaulLu commented Apr 1, 2022

View reviewed changes

add cookiecutter setting

bd1aaad

improve doc

d583dbe

SaulLu changed the title ~~[WIP] add a template to add missing tokenization test~~ add a template to add missing tokenization test Apr 1, 2022

SaulLu requested review from LysandreJik, patil-suraj, patrickvonplaten and sgugger April 1, 2022 16:22

sgugger approved these changes Apr 4, 2022

View reviewed changes

templates/adding_a_missing_tokenization_test/README.md Outdated Show resolved Hide resolved

Update templates/adding_a_missing_tokenization_test/README.md

e3441fd

Co-authored-by: Sylvain Gugger <[email protected]>

LysandreJik approved these changes Apr 4, 2022

View reviewed changes

Merge remote-tracking branch 'upstream/main' into LS/add_missing_tests

9dd6cce

SaulLu merged commit 02214cb into huggingface:main Apr 5, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add a template to add missing tokenization test #16553

add a template to add missing tokenization test #16553

SaulLu commented Apr 1, 2022 •

edited

Loading

SaulLu Apr 1, 2022

HuggingFaceDocBuilderDev commented Apr 1, 2022 •

edited

Loading

sgugger left a comment

LysandreJik left a comment

	def test_convert_token_and_id(self):
	"""Test ``_convert_token_to_id`` and ``_convert_id_to_token``."""
	token = "<pad>"
	token_id = 1

	self.assertEqual(self.get_tokenizer()._convert_token_to_id(token), token_id)
	self.assertEqual(self.get_tokenizer()._convert_id_to_token(token_id), token)

add a template to add missing tokenization test #16553

add a template to add missing tokenization test #16553

Conversation

SaulLu commented Apr 1, 2022 • edited Loading

What does this PR do?

Before submitting

Who can review?

SaulLu Apr 1, 2022

Choose a reason for hiding this comment

HuggingFaceDocBuilderDev commented Apr 1, 2022 • edited Loading

sgugger left a comment

Choose a reason for hiding this comment

LysandreJik left a comment

Choose a reason for hiding this comment

SaulLu commented Apr 1, 2022 •

edited

Loading

HuggingFaceDocBuilderDev commented Apr 1, 2022 •

edited

Loading