-
Notifications
You must be signed in to change notification settings - Fork 39
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Token as returned by pipe() is relative to the sentence boundaries #53
Comments
Thanks. That should definitely be fixed. The Similarly, the |
Assuming you are planning to impl this since you self assigned (thanks!) - do you have an idea how long this will take? I'd be happy to take this on tonight since this is required to resolve some production fallout with cargo-spellcheck's current (shit) tokenization. |
Yes, I thought about a I can prioritize this, in that case it I'd estimate to land it latest on Monday - if it's not working by then I'll just fix the In the meantime you can work around it in a semi-hacky way: let text = "A täst. Another test.";
for sentence in tokenizer.pipe(text) {
for token in sentence {
let offset_bytes = token.sentence.as_ptr() as usize - text.as_ptr() as usize;
let offset_chars = text[..offset_bytes].chars().count();
let fixed_char_span = (
token.char_span.0 + offset_chars,
token.char_span.1 + offset_chars,
);
let fixed_byte_span = (
token.byte_span.0 + offset_bytes,
token.byte_span.1 + offset_bytes,
);
println!(
"{} {:?} {:?}",
token.word.text.as_ref(),
fixed_char_span,
fixed_byte_span
);
}
} |
It bumps the complexity quite a bit especially for a |
I'll have some limited time on the we if you won't get around to it until then 🙃 |
Hi, sorry for not sticking with the timeline I gave above, I was busy this week. There's now a PR for this: #54. There's some things still to do but I should be done soon. I'd definitely appreciate a review though! (once it is ready). |
using
fn pipe()
returns a set of tokens, that includes spans relative to the sentence, but there seems to be no trivial way of retrieving the spans from within the originaltext
provided topipe
.Suggestion: Use a
Range<usize>
instead of a tuple for the relevant range of bytes/ characters for easier usage and make that relative to the input text.Since for single sentences, there is no change in semantics. For multi sentence ones there is.
It would also make sense to add the respective bounds in bytes and chars of the sentence (or replace the sentence entirely).
Related cargo spellcheck issue drahnr/cargo-spellcheck#162
The text was updated successfully, but these errors were encountered: