Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add postprocessor::Sequence #1005

Closed
wants to merge 19 commits into from
Closed

Add postprocessor::Sequence #1005

wants to merge 19 commits into from

Conversation

mishig25
Copy link
Contributor

@mishig25 mishig25 commented May 27, 2022

Closes #873 by implementing preprocessor::Sequence

as suggested by @SaulLu,

Let me just re-share here a design idea that we had - I think - also discussed offline a long time ago: add a new method to all processors - for example named process_chain - which will have another composable signature (a list) and which will be the method called by processors.Sequence. 😊

Please let me know if proposed changes look good, if so, I will add tests. Otherwise, I'm happy to make any changes as necessary. Also, please see the comments as I had some questions

@mishig25 mishig25 marked this pull request as draft May 27, 2022 14:52
@mishig25 mishig25 requested review from Narsil, SaulLu and thomasw21 May 27, 2022 15:19
@thomasw21
Copy link
Contributor

Not sure I understand why we use process_chain ... I think all fn process should take encodings: Vec<Encoding> instead, and Sequence is just a for loop that calls process instead of adding a new endpoint?

Comment on lines 53 to 56
for (i, encoding) in encodings.iter_mut().enumerate() {
encoding.set_sequence_id(i);
}
Ok(Encoding::merge(encodings, false))
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to emulate

encoding.set_sequence_id(0);
pair.set_sequence_id(1);
encoding.merge_with(pair, false);

@mishig25
Copy link
Contributor Author

mishig25 commented May 30, 2022

@Narsil please feel free to do another round of review.
The things I've changed based on our huddle on friday are:

  1. process_chain should not use/call process
  2. process_chain should not be optional (i.e. every preprocessor must implement process_chain method)
  3. template processing should err if the input (encodnings vector) length is anything other than 1 or 2
  4. added tests preprocessor (for bytelevel, template, sequence)
  • add bindings implementation for composable processors

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint.

Copy link
Collaborator

@Narsil Narsil left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems a fine direction but we need to remove the code within the process function in each implementor of the trait no ?

tokenizers/src/pre_tokenizers/byte_level.rs Show resolved Hide resolved
"HelloĠĠ".into(),
"ĠĠĠĠ".into(),
],
vec![],
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We probably want to check some type_ids and other vec being filled here.

(Maybe you added the tests directly within the merge function which would be better, will have to check)

@mishig25
Copy link
Contributor Author

mishig25 commented Jun 1, 2022

we need to remove the code within the process

do you mean, that process implementation should use/call process_chain
if so, commit 78cb806 does so

except processors::Sequence::process, process implementations for other processors (Bert, Roberta, ByteLevel, Template) are identical, which is:

fn process(
    &self,
    encoding: Encoding,
    pair_encoding: Option<Encoding>,
    add_special_tokens: bool,
) -> Result<Encoding> {
    let mut encodings = vec![encoding];
    if let Some(encoding) = pair_encoding {
        encodings.push(encoding);
    }

    let encodings = self.process_chain(encodings, add_special_tokens)?;

    <dyn PostProcessor>::merge_encodings(encodings)
}

therefore, made process of PostPorcoessor trait here have a default implementation, wdyt?

@@ -68,6 +70,14 @@ impl PostProcessor for PyPostProcessor {
self.processor
.process(encoding, pair_encoding, add_special_tokens)
}

fn process_chain(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it make it available to python? Is there a way not to do that? One of the discuss we vaguely discussed with @SaulLu is that we could keep this feature only in rust so that the day we want to break it and change the API it doesn't count as a API breaking change. What do you think?

cc @Narsil

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because of process_chain should not be optional (i.e. every preprocessor must implement process_chain method) & the line below

impl PostProcessor for PyPostProcessor {

rust will not compile with error: PyPostProcessor does not implement process_chain

the alternative is to make implementing process_chain option, which we don't want based on process_chain should not be optional (i.e. every preprocessor must implement process_chain method)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah so I see, thanks! Would be down to know if it would make sense to build such a feature where we can construct private abstract methods

@mishig25 mishig25 changed the title Add preprocessor::Sequence Add postprocessor::Sequence Jun 1, 2022
@SaulLu
Copy link
Contributor

SaulLu commented Jun 27, 2022

Hello to all!

Thanks a lot again @mishig25 for taking the lead on this feature! 🙌

I was wondering where we are on this feature! What is currently the sticking point? Is there anything we need to discuss?

Narsil and others added 7 commits August 23, 2022 16:00
* Changing `Decoder` trait to be more composable. (#938)

* Changing `Decoder` trait to be more composable.

Fix #872

* Fixing Python side.

* Fixing test.

* Updating cleanup signature, removing turbofish.

* Adding `Sequence` Decoder.
…#1009)

* Adding `unstable_wasm` feature + example to run `tokenizers` on wasm.

Co-Authored-By: josephrocca <[email protected]>
Co-Authored-By: Matthias Brunel <[email protected]>

* Adding some serialization tests.

* Updating with comments.

Co-authored-by: josephrocca <[email protected]>
Co-authored-by: Matthias Brunel <[email protected]>
* Update README.md

Add reference to normalizer blog post

* Update lib.rs

* Fixing PR + clippy on node.

* Update readme to match docstring.

* Other clippy warning.

Co-authored-by: Nicolas Patry <[email protected]>
Bumps [terser](https://github.com/terser/terser) from 4.8.0 to 4.8.1.
- [Release notes](https://github.com/terser/terser/releases)
- [Changelog](https://github.com/terser/terser/blob/master/CHANGELOG.md)
- [Commits](https://github.com/terser/terser/commits)

---
updated-dependencies:
- dependency-name: terser
  dependency-type: indirect
...

Signed-off-by: dependabot[bot] <[email protected]>

Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
@mishig25
Copy link
Contributor Author

mishig25 commented Aug 24, 2022

Superceded by #1047

@mishig25 mishig25 closed this Aug 24, 2022
@mishig25 mishig25 deleted the processor_sequence branch August 24, 2022 08:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add a Sequence to the processors
8 participants