Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add test checking the offsets for an input splitted into words for different add_prefix_space and trim_offsets args #1

Closed
wants to merge 142 commits into from

Conversation

SaulLu
Copy link
Owner

@SaulLu SaulLu commented Dec 20, 2021

What does this PR do?

This PR shows a test that will not pass until we have a version of the tokenizer library that includes this change.

cc @LysandreJik for visibility

codesue and others added 30 commits December 13, 2021 08:31
* Fix doc examples: cannot import name

* remove copy because of some necessary minor changes (maybe add copy to the individual methods instead)

* Keep copy with some modifications

Co-authored-by: ydshieh <[email protected]>
* Wip on metadata update

* Most of the script

* Add a job to auto-update the transformers metadata

* Style
* Mention no images added to repository

* Update CONTRIBUTING.md

Co-authored-by: NielsRogge <[email protected]>

Co-authored-by: NielsRogge <[email protected]>
* avoid tf.tile in embeddings

* remove more tf.tile in embeddings

* clean

Co-authored-by: ydshieh <[email protected]>
* First draft

* Improve docstring + clean up tests

* Remove unused code

* Add check in case one doesn't provide a preprocessor
* Convert Trainer doc page to MarkDown

* Fix repo consistency

* Fix the doc build test job
* Adding some slow test to check for perceiver at least from a high level.

* Re-enabling fast tests for Perceiver ImageClassification.

* Perceiver might try to run without Tokenizer (Fast doesn't exist) and
with FeatureExtractor some text only pipelines.

* Oops.

* Adding a comment for `update_config_with_model_class`.

* Remove `model_architecture` to get `tiny_config`.

* Finalize rebase.

* Smarter way to handle undefined FastTokenizer.

* Remove old code.

* Addressing some nits.

* Don't instantiate `None`.
…face#13410)

* use jax and jnp instead of numpy in data_loader

* return batches as np.ndarray
* Adding support for multiple mask tokens.

- Original implem: huggingface#10222

Co-authored-by: njafer <[email protected]>

* In order to accomodate optionally multimodal models like Perceiver

we add information to the tasks to specify tasks where we know for sure
if we need the tokenizer/feature_extractor or not.

* Adding info in the documentation about multi masks.

+ marked as experimental.

* Add a copy() to prevent overriding the same tensor over and over.

* Fixup.

* Adding small test for multi mask with real values..

Co-authored-by: njafer <[email protected]>
…ingface#14722)

* Fix broken links to distillation on index page of documentation

* Fix broken link for distillation in main README

* Run make fixup
* Fake new model

* Fix doc-building test job

* Is this the problem?

* Another try

* Typo

* Clean up

* Can we do without -e ?

* Clean setup
* Initial commit for Keras model cards

* Revert accidental change

* make style

* make style

* make style

* Fix PR comments

* Move repo creation to __init__

* Fixes to README.md creation

* Partial progress for proper card creation on `push_to_hub`

* Proper card creation from `push_to_hub` plus fixes for malformed model cards

* Fixes for model card creation outside the callback

* Adding a model card creation test

* Putting the model card creation test in the right file.
Good job, Matt.

* make style

* Fix model card test temp dir usage

* Fix model card creation when no optimizer present

* Fixes for when training history not present

* Fix accidental edit to test_modeling_common
* Fix code examples

* Fix code example
* Fix docs

* Apply suggestions from code review

Co-authored-by: Sylvain Gugger <[email protected]>

* Code quality

Co-authored-by: Sylvain Gugger <[email protected]>
Co-authored-by: Lysandre <[email protected]>
* PoC for conserving old links

* Do the same for other links

* remap the redirects section

* add instructions on how to move sections

* improve

Co-authored-by: Stas Bekman <[email protected]>
patrickvonplaten and others added 18 commits December 28, 2021 13:41
* speed up canine and mluke

* speed up mbart and mbart50 toks

* upload files
…ngface#14959)

* rename classes

* clean up more namings

* remove bogus file

* Apply suggestions from code review

* Apply suggestions from code review

* replace more names

* more regex replace

* make style

* correct

* correct more

* make style

* finish

* correct more in wav2vec2

* make style

* improve freeze_extractor

* add aliases

* add tf aliases
the absl workaround hasn't been needed since 2019-04 abseil/abseil-py#99 so it should be safe to remove it.
* Fixing a pathological case for slow tokenizers

* Update src/transformers/tokenization_utils.py
huggingface#14881)

* [AutoProcessor] Correct AutoProcessor and automatically add processor class

* up

* up

* up

* up

* up

* up

* up

* up

* continue tomorrow

* up

* up

* up

* make processor class private

* fix loop
…uggingface#14980)

* [Generate] correct encoder_outputs are passed without attention_mask

* Apply suggestions from code review

* up
…ingface#14988)

* Adding `num_return_sequences` support for text2text generation.

Co-Authored-By: Enze <[email protected]>

* Update tests/test_pipelines_text2text_generation.py

Co-authored-by: Sylvain Gugger <[email protected]>

* Update tests/test_pipelines_text2text_generation.py

Co-authored-by: Sylvain Gugger <[email protected]>

Co-authored-by: Enze <[email protected]>
Co-authored-by: Sylvain Gugger <[email protected]>
* Enabling `tokenizers` upgrade.

* Moved ugly comment.

* Tokenizers==0.11.1 needs an update to keep borrow checker

happy in highly contiguous calls.

* Support both 0.11.1 and 0.11.0
…uggingface#14994)

* Allow training to resume even if RNG states are not properly loaded

* Proper f-string
* Map model_type and doc pages names

* Add script

* Fix typo

* Quality

* Manual check for Auto

Co-authored-by: Lysandre <[email protected]>
* Enabling `truncation_side` for Slow and Fast tokenizer.

Co-Authored-by: Niels Rogge <[email protected]>

* Disable failing tests.

* Layout xlm.

* assert -> assertEqual.

Co-authored-by: Niels Rogge <[email protected]>
* Naive ASR chunking

* Fixing batching for ASR.

Co-authored-by: Nicolas Patry <[email protected]>
* Update parallelism.mdx

* Update parallelism.mdx

* Update parallelism.mdx

* Update parallelism.mdx

* Update parallelism.mdx

* Update parallelism.mdx

* Update parallelism.mdx

* Update parallelism.mdx
@Narsil
Copy link

Narsil commented Jan 4, 2022

@SaulLu should be good now, no ?

@SaulLu SaulLu requested review from sgugger and removed request for sgugger January 4, 2022 08:56
@LysandreJik
Copy link

I think there was a small issue with the merge commit 😄
image

@SaulLu
Copy link
Owner Author

SaulLu commented Jan 5, 2022

Indeed 😄 , I will close this PR.

In any case I had to reopen a new one to be between one of my fork branches and the Hugging Face repo (the new PR is here).

@SaulLu SaulLu closed this Jan 5, 2022
SaulLu pushed a commit that referenced this pull request Feb 15, 2022
…5416)

* added classes to get started with constrained beam search

* in progress, think i can directly force tokens now but not yet with the round robin

* think now i have total control, now need to code the bank selection

* technically works as desired, need to optimize and fix design choices leading to undersirable outputs

* complete PR #1 without disjunctive decoding

* removed incorrect tests

* Delete k.txt

* Delete test.py

* Delete test.sh

* revert changes to test scripts

* genutils

* full implementation with testing, no disjunctive yet

* shifted docs

* passing all tests realistically ran locally

* removing accidentally included print statements

* fixed source of error in initial PR test

* fixing the get_device() vs device trap

* fixed documentation docstrings about constrained_beam_search

* fixed tests having failing for Speech2TextModel's floating point inputs

* fix cuda long tensor

* added examples and testing for them and founx & fixed a bug in beam_search and constrained_beam_search

* deleted accidentally added test halting code with assert False

* code reformat

* Update tests/test_generation_utils.py

Co-authored-by: Patrick von Platen <[email protected]>

* Update tests/test_generation_utils.py

Co-authored-by: Patrick von Platen <[email protected]>

* Update tests/test_generation_utils.py

Co-authored-by: Patrick von Platen <[email protected]>

* Update tests/test_generation_utils.py

Co-authored-by: Patrick von Platen <[email protected]>

* Update tests/test_generation_utils.py

* fixing based on comments on PR

* took out the testing code that should but work fails without the beam search moditification ; style changes

* fixing comments issues

* docstrings for ConstraintListState

* typo in PhrsalConstraint docstring

* docstrings improvements

Co-authored-by: Patrick von Platen <[email protected]>
SaulLu pushed a commit that referenced this pull request Mar 30, 2022
)

* added classes to get started with constrained beam search

* in progress, think i can directly force tokens now but not yet with the round robin

* think now i have total control, now need to code the bank selection

* technically works as desired, need to optimize and fix design choices leading to undersirable outputs

* complete PR #1 without disjunctive decoding

* removed incorrect tests

* Delete k.txt

* Delete test.py

* Delete test.sh

* revert changes to test scripts

* genutils

* full implementation with testing, no disjunctive yet

* shifted docs

* passing all tests realistically ran locally

* removing accidentally included print statements

* fixed source of error in initial PR test

* fixing the get_device() vs device trap

* fixed documentation docstrings about constrained_beam_search

* fixed tests having failing for Speech2TextModel's floating point inputs

* fix cuda long tensor

* added examples and testing for them and founx & fixed a bug in beam_search and constrained_beam_search

* deleted accidentally added test halting code with assert False

* code reformat

* Update tests/test_generation_utils.py

Co-authored-by: Patrick von Platen <[email protected]>

* Update tests/test_generation_utils.py

Co-authored-by: Patrick von Platen <[email protected]>

* Update tests/test_generation_utils.py

Co-authored-by: Patrick von Platen <[email protected]>

* Update tests/test_generation_utils.py

Co-authored-by: Patrick von Platen <[email protected]>

* Update tests/test_generation_utils.py

* fixing based on comments on PR

* took out the testing code that should but work fails without the beam search moditification ; style changes

* fixing comments issues

* docstrings for ConstraintListState

* typo in PhrsalConstraint docstring

* docstrings improvements

* finished adding what is sort of an opinionated implementation of disjunctive generation, but it revealed errors in inner beam search logic during testing.

* fixed bug found in constrained beam search that used beam_idx that were not global across all the batches

* disjunctive constraint working 100% correctly

* passing all tests

* Accidentally included mlruns

* Update src/transformers/generation_beam_constraints.py

Co-authored-by: Patrick von Platen <[email protected]>

* Update src/transformers/generation_beam_constraints.py

Co-authored-by: Patrick von Platen <[email protected]>

* complete overhaul of type complexities and other nits

* strict type checks in generate()

* fixing second round of feedback by narsil

* fixed failing generation test because of type check overhaul

* generation test fail fix

* fixing test fails

Co-authored-by: Patrick von Platen <[email protected]>
SaulLu pushed a commit that referenced this pull request May 31, 2022
Improve get_added_vocabulary_hacking
SaulLu pushed a commit that referenced this pull request Jul 18, 2022
* chore: initial commit

Copied the torch implementation of regnets and porting the code to tf step by step. Also introduced an output layer which was needed for regnets.

* chore: porting the rest of the modules to tensorflow

did not change the documentation yet, yet to try the playground on the model

* Fix initilizations (#1)

* fix: code structure in few cases.

* fix: code structure to align tf models.

* fix: layer naming, bn layer still remains.

* chore: change default epsilon and momentum in bn.

* chore: styling nits.

* fix: cross-loading bn params.

* fix: regnet tf model, integration passing.

* add: tests for TF regnet.

* fix: code quality related issues.

* chore: added rest of the files.

* minor additions..

* fix: repo consistency.

* fix: regnet tf tests.

* chore: reorganize dummy_tf_objects for regnet.

* chore: remove checkpoint var.

* chore: remov unnecessary files.

* chore: run make style.

* Update docs/source/en/model_doc/regnet.mdx

Co-authored-by: Sylvain Gugger <[email protected]>

* chore: PR feedback I.

* fix: pt test. thanks to @ydshieh.

* New adaptive pooler (huggingface#3)

* feat: new adaptive pooler

Co-authored-by: @Rocketknight1

* chore: remove image_size argument.

Co-authored-by: matt <[email protected]>

Co-authored-by: matt <[email protected]>

* Empty-Commit

* chore: remove image_size comment.

* chore: remove playground_tf.py

* chore: minor changes related to spacing.

* chore: make style.

* Update src/transformers/models/regnet/modeling_tf_regnet.py

Co-authored-by: amyeroberts <[email protected]>

* Update src/transformers/models/regnet/modeling_tf_regnet.py

Co-authored-by: amyeroberts <[email protected]>

* chore: refactored __init__.

* chore: copied from -> taken from./g

* adaptive pool -> global avg pool, channel check.

* chore: move channel check to stem.

* pr comments - minor refactor and add regnets to doc tests.

* Update src/transformers/models/regnet/modeling_tf_regnet.py

Co-authored-by: NielsRogge <[email protected]>

* minor fix in the xlayer.

* Empty-Commit

* chore: removed from_pt=True.

Co-authored-by: Sayak Paul <[email protected]>
Co-authored-by: Sylvain Gugger <[email protected]>
Co-authored-by: matt <[email protected]>
Co-authored-by: amyeroberts <[email protected]>
Co-authored-by: NielsRogge <[email protected]>
SaulLu pushed a commit that referenced this pull request Sep 9, 2022
proposal of a fix for the MarkupLM fast tokenizer
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.