-
Notifications
You must be signed in to change notification settings - Fork 832
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add postprocessor::Sequence
#1005
Conversation
Not sure I understand why we use |
for (i, encoding) in encodings.iter_mut().enumerate() { | ||
encoding.set_sequence_id(i); | ||
} | ||
Ok(Encoding::merge(encodings, false)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
to emulate
tokenizers/tokenizers/src/tokenizer/mod.rs
Lines 123 to 125 in bb6fea0
encoding.set_sequence_id(0); | |
pair.set_sequence_id(1); | |
encoding.merge_with(pair, false); |
@Narsil please feel free to do another round of review.
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems a fine direction but we need to remove the code within the process
function in each implementor of the trait no ?
"HelloĠĠ".into(), | ||
"ĠĠĠĠ".into(), | ||
], | ||
vec![], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We probably want to check some type_ids
and other vec being filled here.
(Maybe you added the tests directly within the merge
function which would be better, will have to check)
do you mean, that except fn process(
&self,
encoding: Encoding,
pair_encoding: Option<Encoding>,
add_special_tokens: bool,
) -> Result<Encoding> {
let mut encodings = vec![encoding];
if let Some(encoding) = pair_encoding {
encodings.push(encoding);
}
let encodings = self.process_chain(encodings, add_special_tokens)?;
<dyn PostProcessor>::merge_encodings(encodings)
} therefore, made |
@@ -68,6 +70,14 @@ impl PostProcessor for PyPostProcessor { | |||
self.processor | |||
.process(encoding, pair_encoding, add_special_tokens) | |||
} | |||
|
|||
fn process_chain( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because of process_chain should not be optional (i.e. every preprocessor must implement process_chain method) & the line below
impl PostProcessor for PyPostProcessor { |
rust will not compile with error: PyPostProcessor does not implement process_chain
the alternative is to make implementing process_chain
option, which we don't want based on process_chain should not be optional (i.e. every preprocessor must implement process_chain method)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah so I see, thanks! Would be down to know if it would make sense to build such a feature where we can construct private abstract methods
Hello to all! Thanks a lot again @mishig25 for taking the lead on this feature! 🙌 I was wondering where we are on this feature! What is currently the sticking point? Is there anything we need to discuss? |
…#1009) * Adding `unstable_wasm` feature + example to run `tokenizers` on wasm. Co-Authored-By: josephrocca <[email protected]> Co-Authored-By: Matthias Brunel <[email protected]> * Adding some serialization tests. * Updating with comments. Co-authored-by: josephrocca <[email protected]> Co-authored-by: Matthias Brunel <[email protected]>
Signed-off-by: HaoboGu <[email protected]>
* Update README.md Add reference to normalizer blog post * Update lib.rs * Fixing PR + clippy on node. * Update readme to match docstring. * Other clippy warning. Co-authored-by: Nicolas Patry <[email protected]>
Bumps [terser](https://github.com/terser/terser) from 4.8.0 to 4.8.1. - [Release notes](https://github.com/terser/terser/releases) - [Changelog](https://github.com/terser/terser/blob/master/CHANGELOG.md) - [Commits](https://github.com/terser/terser/commits) --- updated-dependencies: - dependency-name: terser dependency-type: indirect ... Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Superceded by #1047 |
Closes #873 by implementing
preprocessor::Sequence
as suggested by @SaulLu,
Please let me know if proposed changes look good, if so, I will add tests. Otherwise, I'm happy to make any changes as necessary. Also, please see the comments as I had some questions