Add `postprocessor::Sequence` #1005

mishig25 · 2022-05-27T14:51:56Z

Closes #873 by implementing preprocessor::Sequence

Let me just re-share here a design idea that we had - I think - also discussed offline a long time ago: add a new method to all processors - for example named process_chain - which will have another composable signature (a list) and which will be the method called by processors.Sequence. 😊

Please let me know if proposed changes look good, if so, I will add tests. Otherwise, I'm happy to make any changes as necessary. Also, please see the comments as I had some questions

tokenizers/src/tokenizer/mod.rs

tokenizers/src/processors/template.rs

thomasw21 · 2022-05-27T16:09:04Z

Not sure I understand why we use process_chain ... I think all fn process should take encodings: Vec<Encoding> instead, and Sequence is just a for loop that calls process instead of adding a new endpoint?

mishig25 · 2022-05-30T08:47:16Z

tokenizers/src/processors/sequence.rs

+                for (i, encoding) in encodings.iter_mut().enumerate() {
+                    encoding.set_sequence_id(i);
+                }
+                Ok(Encoding::merge(encodings, false))


to emulate

tokenizers/tokenizers/src/tokenizer/mod.rs

Lines 123 to 125 in bb6fea0

encoding.set_sequence_id(0);

pair.set_sequence_id(1);

encoding.merge_with(pair, false);

mishig25 · 2022-05-30T08:49:47Z

@Narsil please feel free to do another round of review.
The things I've changed based on our huddle on friday are:

process_chain should not use/call process
process_chain should not be optional (i.e. every preprocessor must implement process_chain method)
template processing should err if the input (encodnings vector) length is anything other than 1 or 2
added tests preprocessor (for bytelevel, template, sequence)

add bindings implementation for composable processors

HuggingFaceDocBuilderDev · 2022-06-01T09:12:44Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint.

Narsil

Seems a fine direction but we need to remove the code within the process function in each implementor of the trait no ?

tokenizers/src/pre_tokenizers/byte_level.rs

Narsil · 2022-06-01T10:24:55Z

tokenizers/src/pre_tokenizers/byte_level.rs

+                "HelloĠĠ".into(),
+                "ĠĠĠĠ".into(),
+            ],
+            vec![],


We probably want to check some type_ids and other vec being filled here.

(Maybe you added the tests directly within the merge function which would be better, will have to check)

mishig25 · 2022-06-01T12:08:27Z

we need to remove the code within the process

do you mean, that process implementation should use/call process_chain
if so, commit 78cb806 does so

except processors::Sequence::process, process implementations for other processors (Bert, Roberta, ByteLevel, Template) are identical, which is:

fn process(
    &self,
    encoding: Encoding,
    pair_encoding: Option<Encoding>,
    add_special_tokens: bool,
) -> Result<Encoding> {
    let mut encodings = vec![encoding];
    if let Some(encoding) = pair_encoding {
        encodings.push(encoding);
    }

    let encodings = self.process_chain(encodings, add_special_tokens)?;

    <dyn PostProcessor>::merge_encodings(encodings)
}

therefore, made process of PostPorcoessor trait here have a default implementation, wdyt?

thomasw21 · 2022-06-01T12:17:54Z

bindings/python/src/processors.rs

@@ -68,6 +70,14 @@ impl PostProcessor for PyPostProcessor {
        self.processor
            .process(encoding, pair_encoding, add_special_tokens)
    }
+
+    fn process_chain(


Does it make it available to python? Is there a way not to do that? One of the discuss we vaguely discussed with @SaulLu is that we could keep this feature only in rust so that the day we want to break it and change the API it doesn't count as a API breaking change. What do you think?

cc @Narsil

Because of process_chain should not be optional (i.e. every preprocessor must implement process_chain method) & the line below

tokenizers/bindings/python/src/processors.rs

Line 59 in 34bf453

impl PostProcessor for PyPostProcessor {

rust will not compile with error: PyPostProcessor does not implement process_chain

the alternative is to make implementing process_chain option, which we don't want based on process_chain should not be optional (i.e. every preprocessor must implement process_chain method)

Ah so I see, thanks! Would be down to know if it would make sense to build such a feature where we can construct private abstract methods

SaulLu · 2022-06-27T08:03:48Z

Hello to all!

Thanks a lot again @mishig25 for taking the lead on this feature! 🙌

I was wondering where we are on this feature! What is currently the sticking point? Is there anything we need to discuss?

* Changing `Decoder` trait to be more composable. (#938) * Changing `Decoder` trait to be more composable. Fix #872 * Fixing Python side. * Fixing test. * Updating cleanup signature, removing turbofish. * Adding `Sequence` Decoder.

…#1009) * Adding `unstable_wasm` feature + example to run `tokenizers` on wasm. Co-Authored-By: josephrocca <[email protected]> Co-Authored-By: Matthias Brunel <[email protected]> * Adding some serialization tests. * Updating with comments. Co-authored-by: josephrocca <[email protected]> Co-authored-by: Matthias Brunel <[email protected]>

Signed-off-by: HaoboGu <[email protected]>

* Update README.md Add reference to normalizer blog post * Update lib.rs * Fixing PR + clippy on node. * Update readme to match docstring. * Other clippy warning. Co-authored-by: Nicolas Patry <[email protected]>

Bumps [terser](https://github.com/terser/terser) from 4.8.0 to 4.8.1. - [Release notes](https://github.com/terser/terser/releases) - [Changelog](https://github.com/terser/terser/blob/master/CHANGELOG.md) - [Commits](https://github.com/terser/terser/commits) --- updated-dependencies: - dependency-name: terser dependency-type: indirect ... Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

mishig25 · 2022-08-24T08:29:50Z

Superceded by #1047

Add preprocessor::Sequence

82c64b1

mishig25 marked this pull request as draft May 27, 2022 14:52

cargo clippy

6023a7a

mishig25 commented May 27, 2022

View reviewed changes

tokenizers/src/tokenizer/mod.rs Outdated Show resolved Hide resolved

mishig25 commented May 27, 2022

View reviewed changes

tokenizers/src/tokenizer/mod.rs Outdated Show resolved Hide resolved

mishig25 commented May 27, 2022

View reviewed changes

tokenizers/src/processors/template.rs Outdated Show resolved Hide resolved

mishig25 requested review from Narsil, SaulLu and thomasw21 May 27, 2022 15:19

mishig25 added 7 commits May 29, 2022 22:38

process_chain impl should not use process

0a0ba68

Add byte_level test

8204b3c

clippy

bb6fea0

Add template processing test

06fd3d4

clippy

a535151

Add tests for Sequence processor

aa3a369

Refactor tests

79dc667

mishig25 commented May 30, 2022

View reviewed changes

patrickvonplaten mentioned this pull request May 31, 2022

[GPT2Tokenizer] Raise ValueError for Fast GPT2Tokenizer with bos token for now huggingface/transformers#17498

Merged

5 tasks

Implement pythin bindgins Sequence processor

34bf453

Narsil reviewed Jun 1, 2022

View reviewed changes

update process impl to use process_chain

78cb806

Add placeholder python bindings test

9d2dbb4

thomasw21 reviewed Jun 1, 2022

View reviewed changes

mishig25 changed the title ~~Add preprocessor::Sequence~~ Add postprocessor::Sequence Jun 1, 2022

SaulLu mentioned this pull request Jun 29, 2022

Fix most of the tokenizer tests. NielsRogge/transformers#41

Merged

5 tasks

SaulLu mentioned this pull request Jul 13, 2022

Word offsets of some fast tokenizers are not compatible with token classification pipeline label aggregation huggingface/transformers#18111

Closed

4 tasks

Narsil and others added 7 commits August 23, 2022 16:00

Add from_bytes approach for creating tokenizers (#1024)

e74045e

Signed-off-by: HaoboGu <[email protected]>

Update README.md (#1019)

1bcaa34

* Update README.md Add reference to normalizer blog post * Update lib.rs * Fixing PR + clippy on node. * Update readme to match docstring. * Other clippy warning. Co-authored-by: Nicolas Patry <[email protected]>

Upgrade macro_rules_attribute to 0.1.2 (#1038)

97ca403

Adding missing node impl.

8496236

Narsil mentioned this pull request Aug 23, 2022

Modify Processor trait to support chaining. #1047

Merged

mishig25 closed this Aug 24, 2022

mishig25 deleted the processor_sequence branch August 24, 2022 08:30

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add `postprocessor::Sequence` #1005

Add `postprocessor::Sequence` #1005

mishig25 commented May 27, 2022 •

edited

Loading

thomasw21 commented May 27, 2022

mishig25 May 30, 2022

mishig25 commented May 30, 2022 •

edited

Loading

HuggingFaceDocBuilderDev commented Jun 1, 2022

Narsil left a comment

Narsil Jun 1, 2022

mishig25 commented Jun 1, 2022 •

edited

Loading

thomasw21 Jun 1, 2022

mishig25 Jun 1, 2022

thomasw21 Jun 2, 2022

SaulLu commented Jun 27, 2022

mishig25 commented Aug 24, 2022 •

edited

Loading

	encoding.set_sequence_id(0);
	pair.set_sequence_id(1);
	encoding.merge_with(pair, false);

Add postprocessor::Sequence #1005

Add postprocessor::Sequence #1005

Conversation

mishig25 commented May 27, 2022 • edited Loading

thomasw21 commented May 27, 2022

mishig25 May 30, 2022

Choose a reason for hiding this comment

mishig25 commented May 30, 2022 • edited Loading

HuggingFaceDocBuilderDev commented Jun 1, 2022

Narsil left a comment

Choose a reason for hiding this comment

Narsil Jun 1, 2022

Choose a reason for hiding this comment

mishig25 commented Jun 1, 2022 • edited Loading

thomasw21 Jun 1, 2022

Choose a reason for hiding this comment

mishig25 Jun 1, 2022

Choose a reason for hiding this comment

thomasw21 Jun 2, 2022

Choose a reason for hiding this comment

SaulLu commented Jun 27, 2022

mishig25 commented Aug 24, 2022 • edited Loading

Add `postprocessor::Sequence` #1005

Add `postprocessor::Sequence` #1005

mishig25 commented May 27, 2022 •

edited

Loading

mishig25 commented May 30, 2022 •

edited

Loading

mishig25 commented Jun 1, 2022 •

edited

Loading

mishig25 commented Aug 24, 2022 •

edited

Loading