add test checking the offsets for an input splitted into words for different `add_prefix_space` and `trim_offsets` args #1

SaulLu · 2021-12-20T17:45:52Z

What does this PR do?

This PR shows a test that will not pass until we have a version of the tokenizer library that includes this change.

cc @LysandreJik for visibility

)

* Fix doc examples: cannot import name * remove copy because of some necessary minor changes (maybe add copy to the individual methods instead) * Keep copy with some modifications Co-authored-by: ydshieh <[email protected]>

Co-authored-by: ydshieh <[email protected]>

* Wip on metadata update * Most of the script * Add a job to auto-update the transformers metadata * Style

* Mention no images added to repository * Update CONTRIBUTING.md Co-authored-by: NielsRogge <[email protected]> Co-authored-by: NielsRogge <[email protected]>

* avoid tf.tile in embeddings * remove more tf.tile in embeddings * clean Co-authored-by: ydshieh <[email protected]>

* First draft * Improve docstring + clean up tests * Remove unused code * Add check in case one doesn't provide a preprocessor

* Convert Trainer doc page to MarkDown * Fix repo consistency * Fix the doc build test job

* Adding some slow test to check for perceiver at least from a high level. * Re-enabling fast tests for Perceiver ImageClassification. * Perceiver might try to run without Tokenizer (Fast doesn't exist) and with FeatureExtractor some text only pipelines. * Oops. * Adding a comment for `update_config_with_model_class`. * Remove `model_architecture` to get `tiny_config`. * Finalize rebase. * Smarter way to handle undefined FastTokenizer. * Remove old code. * Addressing some nits. * Don't instantiate `None`.

…face#13410) * use jax and jnp instead of numpy in data_loader * return batches as np.ndarray

* Adding support for multiple mask tokens. - Original implem: huggingface#10222 Co-authored-by: njafer <[email protected]> * In order to accomodate optionally multimodal models like Perceiver we add information to the tasks to specify tasks where we know for sure if we need the tokenizer/feature_extractor or not. * Adding info in the documentation about multi masks. + marked as experimental. * Add a copy() to prevent overriding the same tensor over and over. * Fixup. * Adding small test for multi mask with real values.. Co-authored-by: njafer <[email protected]>

…ingface#14722) * Fix broken links to distillation on index page of documentation * Fix broken link for distillation in main README * Run make fixup

…face#14757)

* Fake new model * Fix doc-building test job * Is this the problem? * Another try * Typo * Clean up * Can we do without -e ? * Clean setup

Co-authored-by: ydshieh <[email protected]>

* Initial commit for Keras model cards * Revert accidental change * make style * make style * make style * Fix PR comments * Move repo creation to __init__ * Fixes to README.md creation * Partial progress for proper card creation on `push_to_hub` * Proper card creation from `push_to_hub` plus fixes for malformed model cards * Fixes for model card creation outside the callback * Adding a model card creation test * Putting the model card creation test in the right file. Good job, Matt. * make style * Fix model card test temp dir usage * Fix model card creation when no optimizer present * Fixes for when training history not present * Fix accidental edit to test_modeling_common

* Fix code examples * Fix code example

* Fix docs * Apply suggestions from code review Co-authored-by: Sylvain Gugger <[email protected]> * Code quality Co-authored-by: Sylvain Gugger <[email protected]> Co-authored-by: Lysandre <[email protected]>

* PoC for conserving old links * Do the same for other links * remap the redirects section * add instructions on how to move sections * improve Co-authored-by: Stas Bekman <[email protected]>

* speed up canine and mluke * speed up mbart and mbart50 toks * upload files

…ngface#14959) * rename classes * clean up more namings * remove bogus file * Apply suggestions from code review * Apply suggestions from code review * replace more names * more regex replace * make style * correct * correct more * make style * finish * correct more in wav2vec2 * make style * improve freeze_extractor * add aliases * add tf aliases

the absl workaround hasn't been needed since 2019-04 abseil/abseil-py#99 so it should be safe to remove it.

* Fixing a pathological case for slow tokenizers * Update src/transformers/tokenization_utils.py

huggingface#14881) * [AutoProcessor] Correct AutoProcessor and automatically add processor class * up * up * up * up * up * up * up * up * continue tomorrow * up * up * up * make processor class private * fix loop

…uggingface#14980) * [Generate] correct encoder_outputs are passed without attention_mask * Apply suggestions from code review * up

…ingface#14988) * Adding `num_return_sequences` support for text2text generation. Co-Authored-By: Enze <[email protected]> * Update tests/test_pipelines_text2text_generation.py Co-authored-by: Sylvain Gugger <[email protected]> * Update tests/test_pipelines_text2text_generation.py Co-authored-by: Sylvain Gugger <[email protected]> Co-authored-by: Enze <[email protected]> Co-authored-by: Sylvain Gugger <[email protected]>

* Enabling `tokenizers` upgrade. * Moved ugly comment. * Tokenizers==0.11.1 needs an update to keep borrow checker happy in highly contiguous calls. * Support both 0.11.1 and 0.11.0

…uggingface#14994) * Allow training to resume even if RNG states are not properly loaded * Proper f-string

* Map model_type and doc pages names * Add script * Fix typo * Quality * Manual check for Auto Co-authored-by: Lysandre <[email protected]>

Backward compatibility broken in huggingface#14988

* Enabling `truncation_side` for Slow and Fast tokenizer. Co-Authored-by: Niels Rogge <[email protected]> * Disable failing tests. * Layout xlm. * assert -> assertEqual. Co-authored-by: Niels Rogge <[email protected]>

* Naive ASR chunking * Fixing batching for ASR. Co-authored-by: Nicolas Patry <[email protected]>

Co-authored-by: ydshieh <[email protected]>

* up * up * up

* Update parallelism.mdx * Update parallelism.mdx * Update parallelism.mdx * Update parallelism.mdx * Update parallelism.mdx * Update parallelism.mdx * Update parallelism.mdx * Update parallelism.mdx

Narsil · 2022-01-04T08:18:57Z

@SaulLu should be good now, no ?

LysandreJik · 2022-01-05T08:30:37Z

I think there was a small issue with the merge commit 😄

SaulLu · 2022-01-05T09:54:43Z

Indeed 😄 , I will close this PR.

In any case I had to reopen a new one to be between one of my fork branches and the Hugging Face repo (the new PR is here).

…5416) * added classes to get started with constrained beam search * in progress, think i can directly force tokens now but not yet with the round robin * think now i have total control, now need to code the bank selection * technically works as desired, need to optimize and fix design choices leading to undersirable outputs * complete PR #1 without disjunctive decoding * removed incorrect tests * Delete k.txt * Delete test.py * Delete test.sh * revert changes to test scripts * genutils * full implementation with testing, no disjunctive yet * shifted docs * passing all tests realistically ran locally * removing accidentally included print statements * fixed source of error in initial PR test * fixing the get_device() vs device trap * fixed documentation docstrings about constrained_beam_search * fixed tests having failing for Speech2TextModel's floating point inputs * fix cuda long tensor * added examples and testing for them and founx & fixed a bug in beam_search and constrained_beam_search * deleted accidentally added test halting code with assert False * code reformat * Update tests/test_generation_utils.py Co-authored-by: Patrick von Platen <[email protected]> * Update tests/test_generation_utils.py Co-authored-by: Patrick von Platen <[email protected]> * Update tests/test_generation_utils.py Co-authored-by: Patrick von Platen <[email protected]> * Update tests/test_generation_utils.py Co-authored-by: Patrick von Platen <[email protected]> * Update tests/test_generation_utils.py * fixing based on comments on PR * took out the testing code that should but work fails without the beam search moditification ; style changes * fixing comments issues * docstrings for ConstraintListState * typo in PhrsalConstraint docstring * docstrings improvements Co-authored-by: Patrick von Platen <[email protected]>

) * added classes to get started with constrained beam search * in progress, think i can directly force tokens now but not yet with the round robin * think now i have total control, now need to code the bank selection * technically works as desired, need to optimize and fix design choices leading to undersirable outputs * complete PR #1 without disjunctive decoding * removed incorrect tests * Delete k.txt * Delete test.py * Delete test.sh * revert changes to test scripts * genutils * full implementation with testing, no disjunctive yet * shifted docs * passing all tests realistically ran locally * removing accidentally included print statements * fixed source of error in initial PR test * fixing the get_device() vs device trap * fixed documentation docstrings about constrained_beam_search * fixed tests having failing for Speech2TextModel's floating point inputs * fix cuda long tensor * added examples and testing for them and founx & fixed a bug in beam_search and constrained_beam_search * deleted accidentally added test halting code with assert False * code reformat * Update tests/test_generation_utils.py Co-authored-by: Patrick von Platen <[email protected]> * Update tests/test_generation_utils.py Co-authored-by: Patrick von Platen <[email protected]> * Update tests/test_generation_utils.py Co-authored-by: Patrick von Platen <[email protected]> * Update tests/test_generation_utils.py Co-authored-by: Patrick von Platen <[email protected]> * Update tests/test_generation_utils.py * fixing based on comments on PR * took out the testing code that should but work fails without the beam search moditification ; style changes * fixing comments issues * docstrings for ConstraintListState * typo in PhrsalConstraint docstring * docstrings improvements * finished adding what is sort of an opinionated implementation of disjunctive generation, but it revealed errors in inner beam search logic during testing. * fixed bug found in constrained beam search that used beam_idx that were not global across all the batches * disjunctive constraint working 100% correctly * passing all tests * Accidentally included mlruns * Update src/transformers/generation_beam_constraints.py Co-authored-by: Patrick von Platen <[email protected]> * Update src/transformers/generation_beam_constraints.py Co-authored-by: Patrick von Platen <[email protected]> * complete overhaul of type complexities and other nits * strict type checks in generate() * fixing second round of feedback by narsil * fixed failing generation test because of type check overhaul * generation test fail fix * fixing test fails Co-authored-by: Patrick von Platen <[email protected]>

Improve get_added_vocabulary_hacking

@ydshieh

* chore: initial commit Copied the torch implementation of regnets and porting the code to tf step by step. Also introduced an output layer which was needed for regnets. * chore: porting the rest of the modules to tensorflow did not change the documentation yet, yet to try the playground on the model * Fix initilizations (#1) * fix: code structure in few cases. * fix: code structure to align tf models. * fix: layer naming, bn layer still remains. * chore: change default epsilon and momentum in bn. * chore: styling nits. * fix: cross-loading bn params. * fix: regnet tf model, integration passing. * add: tests for TF regnet. * fix: code quality related issues. * chore: added rest of the files. * minor additions.. * fix: repo consistency. * fix: regnet tf tests. * chore: reorganize dummy_tf_objects for regnet. * chore: remove checkpoint var. * chore: remov unnecessary files. * chore: run make style. * Update docs/source/en/model_doc/regnet.mdx Co-authored-by: Sylvain Gugger <[email protected]> * chore: PR feedback I. * fix: pt test. thanks to @ydshieh. * New adaptive pooler (huggingface#3) * feat: new adaptive pooler Co-authored-by: @Rocketknight1 * chore: remove image_size argument. Co-authored-by: matt <[email protected]> Co-authored-by: matt <[email protected]> * Empty-Commit * chore: remove image_size comment. * chore: remove playground_tf.py * chore: minor changes related to spacing. * chore: make style. * Update src/transformers/models/regnet/modeling_tf_regnet.py Co-authored-by: amyeroberts <[email protected]> * Update src/transformers/models/regnet/modeling_tf_regnet.py Co-authored-by: amyeroberts <[email protected]> * chore: refactored __init__. * chore: copied from -> taken from./g * adaptive pool -> global avg pool, channel check. * chore: move channel check to stem. * pr comments - minor refactor and add regnets to doc tests. * Update src/transformers/models/regnet/modeling_tf_regnet.py Co-authored-by: NielsRogge <[email protected]> * minor fix in the xlayer. * Empty-Commit * chore: removed from_pt=True. Co-authored-by: Sayak Paul <[email protected]> Co-authored-by: Sylvain Gugger <[email protected]> Co-authored-by: matt <[email protected]> Co-authored-by: amyeroberts <[email protected]> Co-authored-by: NielsRogge <[email protected]>

proposal of a fix for the MarkupLM fast tokenizer

codesue and others added 30 commits December 13, 2021 08:31

Add ability to get a list of supported pipeline tasks (huggingface#14732

c17e7cd

)

Fix the perceiver docs (huggingface#14748)

6e05bb1

[CI/pt-nightly] switch to cuda-11.3 (huggingface#14726)

8362d07

Swap TF and PT code inside two blocks (huggingface#14742)

fc74c84

Fix doc examples: cannot import name (huggingface#14698)

ca0b82b

* Fix doc examples: cannot import name * remove copy because of some necessary minor changes (maybe add copy to the individual methods instead) * Keep copy with some modifications Co-authored-by: ydshieh <[email protected]>

Fix: change tooslow to slow (huggingface#14734)

12d9b95

Co-authored-by: ydshieh <[email protected]>

Small fixes for the doc (huggingface#14751)

c3cd88a

Update transformers metadata (huggingface#14724)

64e92ed

* Wip on metadata update * Most of the script * Add a job to auto-update the transformers metadata * Style

Fix name

e4666bf

Mention no images added to repository (huggingface#14738)

6ac0fac

* Mention no images added to repository * Update CONTRIBUTING.md Co-authored-by: NielsRogge <[email protected]> Co-authored-by: NielsRogge <[email protected]>

Avoid using tf.tile in embeddings for TF models (huggingface#14735)

15a9d01

* avoid tf.tile in embeddings * remove more tf.tile in embeddings * clean Co-authored-by: ydshieh <[email protected]>

Change how to load config of XLNetLMHeadModel (huggingface#14746)

971e366

Improve perceiver (huggingface#14750)

e926ea2

* First draft * Improve docstring + clean up tests * Remove unused code * Add check in case one doesn't provide a preprocessor

Convert Trainer doc page to MarkDown (huggingface#14753)

7533d30

* Convert Trainer doc page to MarkDown * Fix repo consistency * Fix the doc build test job

Update Table of Contents (huggingface#14755)

322d416

Make data shuffling in run_clm_flax.py respect global seed (hugging…

2a606f9

…face#13410) * use jax and jnp instead of numpy in data_loader * return batches as np.ndarray

Fix broken links to distillation on index page of documentation (hugg…

851a789

…ingface#14722) * Fix broken links to distillation on index page of documentation * Fix broken link for distillation in main README * Run make fixup

[doc] performance: groups of operations by compute-intensity (hugging…

fdf3ce2

…face#14757)

Fix the doc_build_test job (huggingface#14774)

7e61d56

* Fake new model * Fix doc-building test job * Is this the problem? * Another try * Typo * Clean up * Can we do without -e ? * Clean setup

Fix preprocess_function in run_summarization_flax.py (huggingface#14769)

a94105f

Co-authored-by: ydshieh <[email protected]>

Update t5.rst (huggingface#14776)

72c6e8b

Update Perceiver code examples (huggingface#14783)

50bc57c

* Fix code examples * Fix code example

Improve Perceiver docs (huggingface#14786)

aece7ba

* Fix docs * Apply suggestions from code review Co-authored-by: Sylvain Gugger <[email protected]> * Code quality Co-authored-by: Sylvain Gugger <[email protected]> Co-authored-by: Lysandre <[email protected]>

Release: v4.14.0

960d8cb

Docs for v4.14.0

7c9c41f

Move import (huggingface#14787)

c40ecfd

PoC for conserving old links (huggingface#14754)

459677a

* PoC for conserving old links * Do the same for other links * remap the redirects section * add instructions on how to move sections * improve Co-authored-by: Stas Bekman <[email protected]>

patrickvonplaten and others added 18 commits December 28, 2021 13:41

Update README.md (huggingface#14965)

f80775d

[Tests] Speed up tokenizer tests (huggingface#14964)

1bfa347

* speed up canine and mluke * speed up mbart and mbart50 toks * upload files

refactor: replace assert with ValueError (huggingface#14970)

04cddaf

remove absl workaround as it's no longer needed (huggingface#14909)

d1ba56d

the absl workaround hasn't been needed since 2019-04 abseil/abseil-py#99 so it should be safe to remove it.

Fixing a pathological case for slow tokenizers (huggingface#14981)

d7d60df

* Fixing a pathological case for slow tokenizers * Update src/transformers/tokenization_utils.py

[Generate] correct encoder_outputs are passed without attention_mask (h…

c043ce6

…uggingface#14980) * [Generate] correct encoder_outputs are passed without attention_mask * Apply suggestions from code review * up

Enabling tokenizers upgrade. (huggingface#14941)

08cb571

* Enabling `tokenizers` upgrade. * Moved ugly comment. * Tokenizers==0.11.1 needs an update to keep borrow checker happy in highly contiguous calls. * Support both 0.11.1 and 0.11.0

Allow training to resume even if RNG states are not properly loaded (h…

e68c375

…uggingface#14994) * Allow training to resume even if RNG states are not properly loaded * Proper f-string

Map model_type and doc pages names (huggingface#14944)

8f6373c

* Map model_type and doc pages names * Add script * Fix typo * Quality * Manual check for Auto Co-authored-by: Lysandre <[email protected]>

Fixing t2t pipelines lists outputs. (huggingface#15008)

8c2618e

Backward compatibility broken in huggingface#14988

Improve truncation_side (huggingface#14947)

d33dc79

* Enabling `truncation_side` for Slow and Fast tokenizer. Co-Authored-by: Niels Rogge <[email protected]> * Disable failing tests. * Layout xlm. * assert -> assertEqual. Co-authored-by: Niels Rogge <[email protected]>

Large audio chunking for the existing ASR pipeline (huggingface#14896)

38f95d1

* Naive ASR chunking * Fixing batching for ASR. Co-authored-by: Nicolas Patry <[email protected]>

fix missing import (huggingface#15016)

0b4c3a1

Co-authored-by: ydshieh <[email protected]>

[Tests] Correct Wav2Vec2 & WavLM tests (huggingface#15015)

dbac889

* up * up * up

Update parallelism.mdx (huggingface#15013)

f2ab218

* Update parallelism.mdx * Update parallelism.mdx * Update parallelism.mdx * Update parallelism.mdx * Update parallelism.mdx * Update parallelism.mdx * Update parallelism.mdx * Update parallelism.mdx

SaulLu requested review from sgugger and removed request for sgugger January 4, 2022 08:56

Merge branch 'master' into add-roberta-test-split-into-words

ddc94b0

SaulLu closed this Jan 5, 2022

SaulLu pushed a commit that referenced this pull request May 31, 2022

Merge pull request #1 from datquocnguyen/main

176e323

Improve get_added_vocabulary_hacking

SaulLu pushed a commit that referenced this pull request Sep 9, 2022

Merge pull request #1 from SaulLu/lockon_modeling_markuplm_bis

5cc9282

proposal of a fix for the MarkupLM fast tokenizer

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add test checking the offsets for an input splitted into words for different `add_prefix_space` and `trim_offsets` args #1

add test checking the offsets for an input splitted into words for different `add_prefix_space` and `trim_offsets` args #1

SaulLu commented Dec 20, 2021

Narsil commented Jan 4, 2022

LysandreJik commented Jan 5, 2022

SaulLu commented Jan 5, 2022

add test checking the offsets for an input splitted into words for different add_prefix_space and trim_offsets args #1

add test checking the offsets for an input splitted into words for different add_prefix_space and trim_offsets args #1

Conversation

SaulLu commented Dec 20, 2021

What does this PR do?

Narsil commented Jan 4, 2022

LysandreJik commented Jan 5, 2022

SaulLu commented Jan 5, 2022

add test checking the offsets for an input splitted into words for different `add_prefix_space` and `trim_offsets` args #1

add test checking the offsets for an input splitted into words for different `add_prefix_space` and `trim_offsets` args #1